desihub / desispec

DESI spectral pipeline
BSD 3-Clause "New" or "Revised" License
36 stars 24 forks source link

desi_archive_tilenight make checksums #1762

Open sbailey opened 2 years ago

sbailey commented 2 years ago

When desi_archive_tilenight creates each tiles/archive/TILEID/ARCHIVEDATE directory, it should also create checksums for that directory.

@weaverba137 please specify how checksums are created for productions so that we use a consistent method (checksum algorithm, filename, ...)

Related is #1644 about cross production tile archiving. Nominally this form of archiving would create a link daily/tiles/archive/TILEID/ARCHIVEDATE -> ../../../../guadalupe/tiles/cumulative/TILEID/LASTNIGHT . Ideally the guadalupe production would already have a checksum file in tiles/cumulative/TILEID/LASTNIGHT matching the same form that we would have put into daily/tiles/archive/TILEID/ARCHIVEDATE if it wasn't a link. If productions like guadalupe have a different organization for where it would put the checksum, let's define that and discuss options.

weaverba137 commented 2 years ago

I note for the record that currently there are no symlinks in daily/tiles/archive. What is the level of readiness for addressing #1644 versus this issue? Specifically, guadalupe checksums are being created in the immediate future (~days) and therefore:

This also suggests that the layout of the tiles/archive/TILEID/ARCHIVEDATE directory is the same as or very similar to the layout of a SPECPROD/tiles/cumulative/TILEID/LASTNIGHT directory. Is this a reasonably safe assumption?

By "reasonably safe": e.g. there could be differences in the number or types of files in certain cases, but there will not be differences in subdirectories. In this case, there will be a logs/ subdirectory but there shouldn't be any other subdirectories.

sbailey commented 2 years ago

No one is actively working on #1644 (cross prod archiving), so it can wait for guadalupe checksumming.

We do not want to add checksum files to nights that will be symlinked to guadalupe anyway.

Clarifying: cross production archiving will symlink daily/tiles/archive/TILEID/ARCHIVEDATE to a guadalupe/tiles/cumulative/TILEID/LASTNIGHT directory; we will not be creating new guadalupe/tiles/archive/ directories. i.e. the archiving process is a way of freezing a cumulative/TILEID/LASTNIGHT directory, either by moving it to an archive directory (daily) or otherwise linking to a guaranteed frozen copy (e.g. guadalupe). i.e. I think you can proceed with guadalupe checksums, or otherwise I am misunderstanding the concern.

This also suggests that the layout of the tiles/archive/TILEID/ARCHIVEDATE directory is the same as or very similar to the layout of a SPECPROD/tiles/cumulative/TILEID/LASTNIGHT directory. Is this a reasonably safe assumption?

Yes, they are identical in structure. In the normal archiving case, the tiles/archive/TILEID/ARCHIVEDATE is a moved copy of files that were originally in tiles/cumulative/TILEID/LASTNIGHT, and a there is a symlink left behind in tiles/cumulative/TILEID/LASTNIGHT to the new archived location. In the case of cross production archiving, it will link directly to a tiles/cumulative/TILEID/LASTNIGHT directory. So they are by construction the same structure.

weaverba137 commented 2 years ago

@sbailey, Indeed, it's not a concern in regards to creating guadalupe checksums. To expand on the process a bit:

  1. Tile archiving should create checksums for newly-created ARCHIVEDATE directories.
  2. Some other process will need to create checksums for ARCHIVEDATE directories that already exist.
  3. That other process should skip ARCHIVEDATE directories that will be replaced by symlinks into guadalupe.
sbailey commented 2 years ago

Clarifying item 3:

  1. That other process should skip ARCHIVEDATE directories that will be replaced by symlinks into guadalupe.

When we re-archive a tile linking to guadalupe, that would get a new ARCHIVEDATE so that we don't break the previous archived version that we promised not to change. i.e. we will not replace existing ARCHIVEDATEs with a link to guadalupe instead. They are archived, frozen, and never supposed to change (except getting their checksums added).

Note that ARCHIVEDATE is the date that we decided to promote a particular processing to archival status for MTL decisions; it is not the same as LASTNIGHT (the last night of data included in that particular cumulative coadd).

weaverba137 commented 2 years ago

Ah, OK. In that case the script to create checksums for pre-existing ARCHIVEDATE should just do so for all of them. Much simpler.