Closed joezuntz closed 2 years ago
There is a significant overlap in functionality between GCRCatalogs and the proposed new facility, but each also does things the other does not. Where they do overlap (that includes at least registration and look-up) they should be smoothly integrated. I suspect that will entail enhancements to GCRCatalogs and possibly constraints on how the new facility handles these functions.
As @JoanneBogart said, the "catalog register" part of GCRCatalogs
has many functionalities mentioned in the slides. It uses a set of yaml files to keep tract of data sets. In the yaml there's some reserved keywords, but otherwise it's very flexible and can store any metadata. It allows each data sets to have aliases too. Joanne and I have also implemented some minimal multi-site support.
We have talked about splitting out the "catalog register" part of GCRCatalogs
as a standalone package/service, but it wasn't a priority so we never got to do it. We can revisit that option.
Maybe Joanne and I can prepare a presentation where we walk through these less advertised feature of GCRCatalogs
so that people can make an informed decision on whether it's something that we want to keep using.
Hello, since you seem to be looking around, I feel like I have to mention DIRAC (initially developed by LHCb): https://github.com/DIRACGrid/DIRAC
DIRAC is mostly known for its workload management system but actually has a pretty good data management system today: in a few words, it's a file catalog with metadata, handles replica, metadata based datasets and automated actions.
@JoanneBogart gave a talk at a computing meeting and proposed starting a new project but using GCR facilities and concepts where possible. This would reuse our code but give us control over the other new things. She identified this work breakdown:
I put together a doc with some example UI behaviour here: https://docs.google.com/document/d/1nwwG_rpsR65kFINMM_ERqlBUub5igC4snYHABoaLBWg/edit?usp=sharing
After learning the needs from the discussion at the CO telecon on Dec 1, I did some more searches and found a tool called "DataLad", which seems very similar to what we are trying to do.
See this short example of DataLad and its philosophy.
I think it's worth taking a closer look at DataLad. Maybe we can use DataLad to do the prototype test that we discussed. Even if in the end we decide not to use DataLad, we may still want to use git-annex to implement what we want.
A prototype version of a library using DataLad is now here: http://github.com/LSSTDESC/desc-data-lad
Many thanks all for the discussion on this RFC - I'll close this now since we've moved to the next phase.
@joezuntz Can you give DESC members read access to the repo?
@yymao sorry it was a typo - fixed now: https://github.com/LSSTDESC/desc-data-lad
Summary
DESC will generate lots of intermediate and output data from its analysis working groups and pipelines. We'd like some kind of system for managing this data.
Description
We have a robust system for managing DESC simulation data, catalogs, and input data from DM. We need a similarly good system for looking after intermediate and output products generated within the collaboration. These include:
Whatever system we adopt needs to let us:
Joe shared some slides describing some straw-man ideas on structure, listing what he would do if he had to start from scratch, but obviously it would be better to use an existing solution or at least start from one, if that's possible.
There are various technologies either created in the collaboration or outside that could either be extended to encompass some of the above goals or used outright. Here are a few that have been mentioned.
So what other requirements do we have on a tool to do this, and/or which solutions should we investigate?