LSSTDESC / DC2-production

Configuration, production, validation specifications and tools for the DC2 Data Set.
BSD 3-Clause "New" or "Revised" License
11 stars 7 forks source link

Data Release Naming Conventions and Handling of Processing Data Transfers #413

Open heather999 opened 3 years ago

heather999 commented 3 years ago

We need to write up concrete steps to handle the naming and versioning of our data releases, which also includes some recommendations for dealing with the data transfers between CC and NERSC. Looking for comments.

An example of the issue using Run2.2i DR6

The CC processing area is named run2.2i-coadd-wfd-dr6-v1 and at NERSC we reused this name for our DR6 v1 release. Processing continues at CC and updated data must be copied to NERSC, but we do not want to update the dataset that has already been released.

Recommendations

Review of Release Naming Conventions

As discussed on Slack our release naming conventions should communicate the run (2.2i or 3.1i), the depth (DR1, DR2, etc.), and the cadence (WFD or DDF).

Examples

For the upcoming Run2.2i DR2 v1 release, a snapshot will be taken at NERSC, resulting in butler data directories:

Additional DR6 releases will use a similar naming convention, however, as long as this processing includes both WFD and DDF visits, no cadence will be indicated in the name:

yymao commented 3 years ago

The general principle sounds good to me. The proposed name run2.2i-wfd-dr2-v1 has a different order for wfd and dr2 from what we did in GCRCatalogs (see https://github.com/LSSTDESC/gcr-catalogs/releases/tag/v1.1.0).

Now that I think about it, putting wfd first seems to make sense. But when we made the decision for GCRCatalogs we somehow went with run2.2i_dr2_wfd...

katrinheitmann commented 3 years ago

I think we should stick with what we had before for simplicity. dr2 is more general then wfd, so I think that's why it's first, though I think the order really doesn't matter much. So for historic reasons, I would choose what we had before.

johannct commented 3 years ago

My take on this : I agree with Katrin, let's keep it simple, I do not think that the order brings any added value, so this is historical.

I am not sure I follow the current train of thoughts on processing naming convention and how it translates into snapshots and released areas and naming convention

Maybe I misunderstand some of what is written above, in which case sorry for the noise. Maybe there is a sense that we need to keep a one to one relationship between processing area and released area. I think that very often this will be naturally guaranteed, as the processing area will be different for any new processing effort. But in the case of 4852 such a one to one requirement seems overkill to me. Caveat : I am not sure how gen3 processing is going to modify my seasoned view of how all this is coming out.

JoanneBogart commented 3 years ago

I basically agree with Johann but would like some clarification of details.

johannct commented 3 years ago

@JoanneBogart

Is it fair to say "naming convention for processing is arbitrary" means "has no particular connection to naming of releases"?There still will be conventions for processing which are suited to the task at hand.

Most important is indeed that there is no reason to think of it as something understandable by people outside of processing, especially the public of the releases. As for conventions relevant for processing, with gen2 there really are nothing built in so I just built rerun names out of the blue, it made sense to me, not necessarily to others. With gen3 I would not be surprised that this needs revisiting and more forethought

If processing areas are exactly mirrored at NERSC, we at least want to avoid conflicts or confusion with releases, which may put restrictions on naming of top-level processing directories

I am not sure I understand your point. I hope that when we speak of releases it is clear that we are not speaking about internal processing directories. But we can make sure that the naming are different, in any case.

In the first example concerning 4852, should it read "I fixed 4852 because it was failed, I did not create a new rerun name because it would have been overkill, but that does not mean that at NERSC the copy should not move to a different directory. " ?

Indeed, tricked by a double negation on the first day back to work. Bummer :)

heather999 commented 3 years ago

Going in order of comments, starting with Yao & Katrin. Ok - we'll go with run2.2i-dr2-wfd-v1 and use something along those lines for other releases. Moving on to Johann & Joanne: Agreed processing areas should not be released. For Run2.2i, we have the unfortunate situation that the rerun area at NERSC contains both the releases and the processings. In the future, that should be avoided, and perhaps I can manage to reorganize the directories at NERSC to create separate processing and release areas. This mixing of directories is part of the reason I would rather the processing directory names not appear to be some release version, but I feel it is also generally confusing. The v1 on run2.2i-coadd-wfd-dr6-v1, which at CC is a processing area, is misleading.
For the specific case of DR6, given that we have made a release, I have been reluctant to rename the released NERSC directory from run2.2i-coadd-wfd-dr6-v1. Upon further consideration, maybe it is time to rename the directories to the form: run2.2i-dr6-v1, update desc-dm-dc2-data and then run2.2i-coadd-wfd-dr6-v1 at NERSC would again be mirrored to the processing area at CC.

snapshots created for releases are copies of a subset of the processing area, and here we need to be very careful to define that subset and when it is appropriate to bump to a new version. For the specific case of DR6 tract 4852 patch 1,5, that should result in a separate released version (v2) of the object catalogs. Concerning the butler rerun area, it would be incorrect to just simply update the 4852 1,5 files (even if this was initially a processing failure), without marking this as an updated versioned release. Due to disk space concerns, maybe we store v1 to tape (or even just store the original version of the updated files to tape, so v1 could be recreated, if that is ever needed) Releasing DR2 v1 now without metacal (which is the plan), when a metacal processing becomes available, I still think that will result in a DR2 v2 release. Whatever the reason for a change in the data released, whether it is updating existing files or adding new ones - I think that deserves a bump in version.

johannct commented 3 years ago

ok, so we disagree on several points here. For 4852, imho there is no difference between reprocessing it and rolling back a stream due to computing failure. And I do not think that you advocate bumping version for each and every random rollback that occurs during processing...... At least for gen2 system, it would have been hell. I do not want to argue forever though, so whatever is ok with the majority is ok with me.

heather999 commented 3 years ago

My thoughts on versioning are strictly in regards to releases. If we release something and name it v1 and we then update the data in some way later, and then release that.. it is v2. We took a long time to release DR6 initially and during that time there were of course rollbacks in the processing, but none of that mattered from a versioning standpoint, because we had not released anything yet.

yymao commented 3 years ago

I think maybe we need a more clear distinction between pre-releases and releases, given that some of the validation tests require the data propagate all the way down to GCRCatalogs.

If we distinguish them, then we can say it's ok to update the content in place for pre-releases, but a snapshot copy must be made for releases.

heather999 commented 3 years ago

For the object catalogs, we have pre-release areas, but have not done that for the butler/rerun areas... we could by creating a pre-release snapshot for validation and then renaming when a release is ready. I think that's fine. I don't think we can use the processing area for our pre-release validation, necessarily. For the DR2 release and beyond.. I could create a release area that is separate from the processing area at NERSC. That area could contain pre-releases, which will be updated as needed based on validation testing.

An immediate thing I would like to reach a consensus on.. renaming the DR6 v1 butler area at NERSC which is run2.2i-coadd-wfd-dr6-v1 and moving to run2.2i-dr6-v1.. this could live under the new release area mentioned above..

yymao commented 3 years ago

For the DR6 object catalogs, previously we have been treating pre-releases like releases, i.e., a new copy is made when a new pre-release is made. I think that's a bit overkill and results in way too many deprecated catalogs in GCRCatalogs in the end.

I am ok with keeping the pre-releases in the release area. The main point is just that if the processing area is updated and we want to propagate the update the pre-release, we can just overwrite the existing pre-release instead of making a new one. For releases we will never overwrite, of course.

I don't really have opinions regarding renaming dr6. What you proposed sounds good to me. But at the very beginning you said that we are using run2.2i-coadd-dr6-processing for syncing with CC. Are you making a new proposal that we make a copy with the name run2.2i-dr6-v1 in the release area, and use run2.2i-coadd-wfd-dr6-v1 for syncing with CC processing? But either is fine with me actually.

johannct commented 3 years ago

If I look at https://github.com/LSSTDESC/desc-dc2-dm-data/blob/master/desc_dc2_dm_data/repos.py the path at NERSC for DR6 is /global/cfs/cdirs/lsst/production/DC2_ImSim/Run2.2i/desc_dm_drp/v19.0.0-v1/rerun/run2.2i-coadd-wfd-dr6-v1 In my mind mirroring means that everything below a root path is strictly identical. Here the rootpath is defined as /global/cfs/cdirs/lsst/production/DC2_ImSim/Run2.2i/desc_dm_drp. Indeed for CC the path is /sps/lssttest/dataproducts/desc/DC2/Run2.2i/v19.0.0-v1/rerun/run2.2i-coadd-wfd-dr6-v1 and its rootpath is /sps/lssttest/dataproducts/desc/DC2/Run2.2i. Everything below each rootpath is strictly identical between the two sites, be it directory hierarchy or content (we relax the strict identity for some intermediary products). This seems a sound situation to me.....

heather999 commented 3 years ago

To answer Yao's question, yes, I was making a new proposal... in the interest of reusing the existing naming convention at CC.

Concerning Johann's comment: As far as I recall, desc-dc2-dm-data doesn't have a rootpath like GCRCatalogs, but does utilize SITE, so we have a set of repos for both NERSC and CC. desc-dc2-dm-data is meant to point to release data rather than processing. We could continue to maintain processing and release data side by side in the same rerun area, but then I think we have to be much more clear about how we manage the naming of processing versus releases. The processing areas at CC and NERSC should continue to live under desc_dm_drp and be mirrored and really should have nothing to do with desc-dc2-dm-data. Releases should similarly be mirrors at NERSC and CC but as of this moment, we are just defining this.. and we would update desc-dc2-dm-data accordingly to point to released data. For now, some of that may live under desc_dm_drp, but I think we want to move away from that. We could have something like:

Run2.2i
 |__ releases 
       |__ 19.0.0
              |__ rerun
heather999 commented 3 years ago

Chatted briefly with Johann offline, and we have an updated proposal. The "releases", which include butler accessible data, would reside under shared, utilizing names that are identical to the naming convention used for the object & dpdd catalogs (which we should review given all the discussions about WFD, DDF, etc). So.. we were thinking about introducing a new area at NERSC (and ultimately CC): /global/cfs/cdirs/lsst/shared/DC2-prod/Run2.2i/butler That would look something like:

19.0.0
|_ CALIB, _mapper, ref_cats, raw, etc (everything the butler needs to make sense of the data - these could be symlinks)
|_ rerun
       |_ run2.2i-dr2-wfd-v1

Asking the Data Access team (@JoanneBogart & @yymao) if they feel it is ok to include butler accessible data in the shared area? Thinking ahead to Gen3.. this might mean including files accessible from Postgres... is that appropriate?

yymao commented 3 years ago

I think that's fine, and a good proposal in fact. The shared area is designed to be mirrored across DESC sites (currently only CC and NERSC, of course), so moving the release area into shared makes sense to me.

Did you have specific concerns regarding including butler/Postgres accessible data in shared? I couldn't think of any immediately.

One note is that all the symlinks in **/butler should be internal (i.e., not linked outside of shared, preferably not linked outside of **/butler)

heather999 commented 3 years ago

Great - that all makes sense. Not today, but in the next couple of days, I will start setting this up at NERSC.

JoanneBogart commented 3 years ago

Putting butler-accessible data under our shared directory sounds good to me. I don't have a problem with including files accessible via Postgres. People just have to understand that to access them in the recommended fashion they have to port the Postgres database as well. (Is that something we should be thinking about? Will they have to re-ingest or is it possible to dump the db and reconstitute it elsewhere?).

heather999 commented 3 years ago

I would imagine it is possible to dump the db and set it up elsewhere. Definitely something we should try to see how it goes. I could imagine other sites may want to mirror and we probably do want to support that. Individual users are another question, where I would assume they may only want access to specific subsets of data... but are they constrained to work at NERSC or CC to access the butler data? Not sure... I guess right now - even without Postgres, only advanced users would do otherwise and extract files to their own machines.