[RFC] A system for managing our intermediate and output analysis data

joezuntz commented 3 years ago

Summary

DESC will generate lots of intermediate and output data from its analysis working groups and pipelines. We'd like some kind of system for managing this data.

Description

We have a robust system for managing DESC simulation data, catalogs, and input data from DM. We need a similarly good system for looking after intermediate and output products generated within the collaboration. These include:

pipeline data like masks, maps, or fitted models.
null test results in plot or tabular form.
science-ready output products that we may some day release.

Whatever system we adopt needs to let us:

organize data so we can refer clearly to specific data sets and files, and group them together.
be flexible enough to store things from across the collaboration, regardless of structure and type.
let us move data easily between systems, or to local machines.
let people easily but securely submit new files when output by pipelines.

Joe shared some slides describing some straw-man ideas on structure, listing what he would do if he had to start from scratch, but obviously it would be better to use an existing solution or at least start from one, if that's possible.

There are various technologies either created in the collaboration or outside that could either be extended to encompass some of the above goals or used outright. Here are a few that have been mentioned.

GCR - @yymao noted how many of the issues here are related to those in GCR and we could adapt where needed.
DataCat - @brianv0 pointed to this solution used in SLAC for camera data management.
DIOS - @boutigny pointed to this broader project we could adapt parts of.
Rucio - @heather999 pointed to this solution that spun out of CERN.
A new DESC project?

So what other requirements do we have on a tool to do this, and/or which solutions should we investigate?

JoanneBogart commented 3 years ago

There is a significant overlap in functionality between GCRCatalogs and the proposed new facility, but each also does things the other does not. Where they do overlap (that includes at least registration and look-up) they should be smoothly integrated. I suspect that will entail enhancements to GCRCatalogs and possibly constraints on how the new facility handles these functions.

yymao commented 3 years ago

As @JoanneBogart said, the "catalog register" part of GCRCatalogs has many functionalities mentioned in the slides. It uses a set of yaml files to keep tract of data sets. In the yaml there's some reserved keywords, but otherwise it's very flexible and can store any metadata. It allows each data sets to have aliases too. Joanne and I have also implemented some minimal multi-site support.

We have talked about splitting out the "catalog register" part of GCRCatalogs as a standalone package/service, but it wasn't a priority so we never got to do it. We can revisit that option.

Maybe Joanne and I can prepare a presentation where we walk through these less advertised feature of GCRCatalogs so that people can make an informed decision on whether it's something that we want to keep using.

bregeon commented 3 years ago

Hello, since you seem to be looking around, I feel like I have to mention DIRAC (initially developed by LHCb): https://github.com/DIRACGrid/DIRAC

DIRAC is mostly known for its workload management system but actually has a pretty good data management system today: in a few words, it's a file catalog with metadata, handles replica, metadata based datasets and automated actions.

joezuntz commented 2 years ago

@JoanneBogart gave a talk at a computing meeting and proposed starting a new project but using GCR facilities and concepts where possible. This would reuse our code but give us control over the other new things. She identified this work breakdown:

Use already-existing shared area at NERSC
Design schema for metadata, translatable to GCRCatalogs config, and vice versa, and outputs of pipelines
Identify impact on GCRCatalogs config files; enhance GCRCatalogs as needed
Design and implement Python API for read, search, create
Set up database at NERSC for metadata
Mechanism and API for file transfers

joezuntz commented 2 years ago

I put together a doc with some example UI behaviour here: https://docs.google.com/document/d/1nwwG_rpsR65kFINMM_ERqlBUub5igC4snYHABoaLBWg/edit?usp=sharing

yymao commented 2 years ago

After learning the needs from the discussion at the CO telecon on Dec 1, I did some more searches and found a tool called "DataLad", which seems very similar to what we are trying to do.

See this short example of DataLad and its philosophy.

I think it's worth taking a closer look at DataLad. Maybe we can use DataLad to do the prototype test that we discussed. Even if in the end we decide not to use DataLad, we may still want to use git-annex to implement what we want.

joezuntz commented 2 years ago

A prototype version of a library using DataLad is now here: http://github.com/LSSTDESC/desc-data-lad

Many thanks all for the discussion on this RFC - I'll close this now since we've moved to the next phase.

yymao commented 2 years ago

@joezuntz Can you give DESC members read access to the repo?

joezuntz commented 2 years ago

@yymao sorry it was a typo - fixed now: https://github.com/LSSTDESC/desc-data-lad

LSSTDESC / RequestForComments

[RFC] A system for managing our intermediate and output analysis data #13