NSLS-II / wishlist

an issue tracker for the big picture
1 stars 0 forks source link

Some thoughts on shared user code and shared derived data #88

Open danielballan opened 9 years ago

danielballan commented 9 years ago

Here I will condense some thoughts from conversations with Mizuki, Rob Petkus, @stuwilkins, and others.

Beamline scientists, postdocs, and users are abusing the beamline user home directories (e.g., /home/xf23id1) and group-writable shared directories (/XF11ID) to store scripts, notebooks, and large files contained derived data. This has several downsides:

However, there is not presently any viable alternative. To create an alternative I propose to:

  1. Do a better job capturing reusable scientific code in the beamline-specific repos (chxtools, etc) and in some cases in scikit-xray. Some beamlines are already doing this pretty well, but we can do more to facilitate.
  2. Distribute curated example/template notebooks through jupyterhub (more on this later...).
  3. When data is commonly processed in the same well-defined way (e.g., subtracted dark-frames), process it automatically and re-capture the results in filestore. This eliminates one reason* for users to stash many large files in their NFS home directories.

All of these ideas have been kicked around for awhile. My point is that they are prequisites to getting the scientists/users away from doing all their work in one messy, shared directory.

We can push on (1) right away. CHX, with @sameera2004 and @ericdill, have been doing some early work in the direction of (2), and I will be working on that in the next couple days. Finally doing (3) involves some major work on metadatastore, filestore, and the databroker.

*(3) It would not solve the issue universally, of course. We cannot capture every form of derived data the users might want to store. We can also teach uses to redo cheap analysis on the fly: in some cases reading/writing may be more expensive than recomputing. But ultimately, it would be nice if people were allowed to put large files in their home directories.

cowanml commented 9 years ago

this has happened everywhere there are shared accounts since the dawn of time

On Wed, 18 Nov 2015, Dan Allan wrote:

Here I will condense some thoughts from conversations with Mizuki, Rob Petkus, @stuwilkins, and others.

Beamline scientists, postdocs, and users are abusing the beamline user home directories (e.g., /home/xf23id1) and group-writable shared directories (/XF11ID) to store scripts, notebooks, and large files contained derived data. This has several downsides:

  • The scripts and notebooks quickly become a junkheap with no provenance or organization. No can be sure when it is safe to delete things because ownership and usage is unclear.
  • The files of derived data -- for example, HDF5 files with "corrected" versions of the images from various scans -- put a strain of the NFS that it is not designed to handle.

However, there is not presently any viable alternative. To create an alternative I propose to:

  1. Do a better job capturing reused scientific code in the beamline-specific repos (chxtools, etc) and in some cases in scikit-xray. Some beamlines are already doing this pretty well, but we can do more to facilitate.
  2. Distribute curated example/template notebooks through jupyterhub (more on this later...).
  3. When data is commonly processed in the same well-defined way (e.g., subtracted dark-frames), process it automatically and re-capture the results in filestore. This eliminates one reason* for users to stash many large files in their NFS home directories.

All of these ideas have been kicked around for awhile. My point is that they are prequisites to getting the scientists/users away from doing all their work in one messy, shared directory.

We can push on (1) right away. CHX, with @sameera2004 and @ericdill, have been doing some early work in the direction of (2), and I will be working on that in the next couple days. Finally doing (3) involves some major work on metadatastore, filestore, and the databroker.

*(3) It would not solve the issue universally, of course. We cannot capture every form of derived data the users might want to store. We can also teach uses to redo cheap analysis on the fly. In some cases reading/writing may be more expensive than recomputing. But ultimately, it would be nice if people were allowed to put large files in their home directories.

? Reply to this email directly or view it on GitHub.[AFOf4y-tcRiGZ0HFF8fvO_obyha4IpeFks5pHR9vgaJpZM4GlNGm.gif]