Some thoughts on shared user code and shared derived data

Here I will condense some thoughts from conversations with Mizuki, Rob Petkus, @stuwilkins, and others.

Beamline scientists, postdocs, and users are abusing the beamline user home directories (e.g., /home/xf23id1) and group-writable shared directories (/XF11ID) to store scripts, notebooks, and large files contained derived data. This has several downsides:

The scripts and notebooks quickly become a junkheap with no provenance or organization. No can be sure when it is safe to delete things because ownership and usage is unclear.
The large files of derived data -- for example, HDF5 files with "corrected" versions of the images from various scans -- put a strain of the NFS that it is not designed to handle.

However, there is not presently any viable alternative. To create an alternative I propose to:

Do a better job capturing reusable scientific code in the beamline-specific repos (chxtools, etc) and in some cases in scikit-xray. Some beamlines are already doing this pretty well, but we can do more to facilitate.
Distribute curated example/template notebooks through jupyterhub (more on this later...).
When data is commonly processed in the same well-defined way (e.g., subtracted dark-frames), process it automatically and re-capture the results in filestore. This eliminates one reason* for users to stash many large files in their NFS home directories.

All of these ideas have been kicked around for awhile. My point is that they are prequisites to getting the scientists/users away from doing all their work in one messy, shared directory.

We can push on (1) right away. CHX, with @sameera2004 and @ericdill, have been doing some early work in the direction of (2), and I will be working on that in the next couple days. Finally doing (3) involves some major work on metadatastore, filestore, and the databroker.

*(3) It would not solve the issue universally, of course. We cannot capture every form of derived data the users might want to store. We can also teach uses to redo cheap analysis on the fly: in some cases reading/writing may be more expensive than recomputing. But ultimately, it would be nice if people were allowed to put large files in their home directories.

this has happened everywhere there are shared accounts since the dawn of time

Individual accounts
If stuck using a shared account, make the homedir purposefully small, have a clear policy it is totally temporary and volatile, and reset it between user groups (a good idea for many reasons). Copying it first, as part of backing up data, doesn't hurt (a good idea for individual accounts too).
Even with individual accounts, users should put the bulk of their stuff in their allocated storage near their data. Capturing code/scripts in github is nice and encourages reuse, but you can't completely get away from files. When users get beamtime, a directory on the beamline's storage should be made for them. Namespace consistent across beamlines would be ideal.

On Wed, 18 Nov 2015, Dan Allan wrote:

Here I will condense some thoughts from conversations with Mizuki, Rob Petkus, @stuwilkins, and others.

Beamline scientists, postdocs, and users are abusing the beamline user home directories (e.g., /home/xf23id1) and group-writable shared directories (/XF11ID) to store scripts, notebooks, and large files contained derived data. This has several downsides:

The scripts and notebooks quickly become a junkheap with no provenance or organization. No can be sure when it is safe to delete things because ownership and usage is unclear.

The files of derived data -- for example, HDF5 files with "corrected" versions of the images from various scans -- put a strain of the NFS that it is not designed to handle.

However, there is not presently any viable alternative. To create an alternative I propose to:

Do a better job capturing reused scientific code in the beamline-specific repos (chxtools, etc) and in some cases in scikit-xray. Some beamlines are already doing this pretty well, but we can do more to facilitate.

Distribute curated example/template notebooks through jupyterhub (more on this later...).

When data is commonly processed in the same well-defined way (e.g., subtracted dark-frames), process it automatically and re-capture the results in filestore. This eliminates one reason* for users to stash many large files in their NFS home directories.

All of these ideas have been kicked around for awhile. My point is that they are prequisites to getting the scientists/users away from doing all their work in one messy, shared directory.

We can push on (1) right away. CHX, with @sameera2004 and @ericdill, have been doing some early work in the direction of (2), and I will be working on that in the next couple days. Finally doing (3) involves some major work on metadatastore, filestore, and the databroker.

*(3) It would not solve the issue universally, of course. We cannot capture every form of derived data the users might want to store. We can also teach uses to redo cheap analysis on the fly. In some cases reading/writing may be more expensive than recomputing. But ultimately, it would be nice if people were allowed to put large files in their home directories.

? Reply to this email directly or view it on GitHub.[AFOf4y-tcRiGZ0HFF8fvO_obyha4IpeFks5pHR9vgaJpZM4GlNGm.gif]

NSLS-II / wishlist

Some thoughts on shared user code and shared derived data #88