LSSTDESC / RequestForComments

BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

[RFC] A system for managing our intermediate and output analysis data #13

Closed joezuntz closed 2 years ago

joezuntz commented 3 years ago

Summary

DESC will generate lots of intermediate and output data from its analysis working groups and pipelines. We'd like some kind of system for managing this data.

Description

We have a robust system for managing DESC simulation data, catalogs, and input data from DM. We need a similarly good system for looking after intermediate and output products generated within the collaboration. These include:

Whatever system we adopt needs to let us:

Joe shared some slides describing some straw-man ideas on structure, listing what he would do if he had to start from scratch, but obviously it would be better to use an existing solution or at least start from one, if that's possible.

There are various technologies either created in the collaboration or outside that could either be extended to encompass some of the above goals or used outright. Here are a few that have been mentioned.

So what other requirements do we have on a tool to do this, and/or which solutions should we investigate?

JoanneBogart commented 3 years ago

There is a significant overlap in functionality between GCRCatalogs and the proposed new facility, but each also does things the other does not. Where they do overlap (that includes at least registration and look-up) they should be smoothly integrated. I suspect that will entail enhancements to GCRCatalogs and possibly constraints on how the new facility handles these functions.

yymao commented 3 years ago

As @JoanneBogart said, the "catalog register" part of GCRCatalogs has many functionalities mentioned in the slides. It uses a set of yaml files to keep tract of data sets. In the yaml there's some reserved keywords, but otherwise it's very flexible and can store any metadata. It allows each data sets to have aliases too. Joanne and I have also implemented some minimal multi-site support.

We have talked about splitting out the "catalog register" part of GCRCatalogs as a standalone package/service, but it wasn't a priority so we never got to do it. We can revisit that option.

Maybe Joanne and I can prepare a presentation where we walk through these less advertised feature of GCRCatalogs so that people can make an informed decision on whether it's something that we want to keep using.

bregeon commented 3 years ago

Hello, since you seem to be looking around, I feel like I have to mention DIRAC (initially developed by LHCb): https://github.com/DIRACGrid/DIRAC

DIRAC is mostly known for its workload management system but actually has a pretty good data management system today: in a few words, it's a file catalog with metadata, handles replica, metadata based datasets and automated actions.

joezuntz commented 2 years ago

@JoanneBogart gave a talk at a computing meeting and proposed starting a new project but using GCR facilities and concepts where possible. This would reuse our code but give us control over the other new things. She identified this work breakdown:

joezuntz commented 2 years ago

I put together a doc with some example UI behaviour here: https://docs.google.com/document/d/1nwwG_rpsR65kFINMM_ERqlBUub5igC4snYHABoaLBWg/edit?usp=sharing

yymao commented 2 years ago

After learning the needs from the discussion at the CO telecon on Dec 1, I did some more searches and found a tool called "DataLad", which seems very similar to what we are trying to do.

See this short example of DataLad and its philosophy.

I think it's worth taking a closer look at DataLad. Maybe we can use DataLad to do the prototype test that we discussed. Even if in the end we decide not to use DataLad, we may still want to use git-annex to implement what we want.

joezuntz commented 2 years ago

A prototype version of a library using DataLad is now here: http://github.com/LSSTDESC/desc-data-lad

Many thanks all for the discussion on this RFC - I'll close this now since we've moved to the next phase.

yymao commented 2 years ago

@joezuntz Can you give DESC members read access to the repo?

joezuntz commented 2 years ago

@yymao sorry it was a typo - fixed now: https://github.com/LSSTDESC/desc-data-lad