dandi / dandisets

755 Dandisets, 815.8 TB total. DataLad super-dataset of all Dandisets from https://github.com/dandisets
10 stars 0 forks source link

Provide support for zarr (.zarr/.ngff) #127

Closed yarikoptic closed 2 years ago

yarikoptic commented 2 years ago

zarr/ refers to a .zarr or .ngff folder asset which are already entering staging server and soon will emerge in the main one. We should be ready (#126 is a stop gap measure so we do not "pollute" our datalad dandisets)

Having established workflow for out-of-band backuping regular assets (based on #103) we will approach backup of zarrs. Some notes:

jwodder commented 2 years ago

@yarikoptic

"original" zarr datasets will correspond to their zarr_ids as the ones used on s3 as prefixes under zarr/ prefix

I can't tell what you're trying to say here.

yarikoptic commented 2 years ago

yes, 1-to-1 to prefix (AKA folder) on S3:

dandi@drogon:~$ s3cmd -c ~/.s3cfg-dandi-backup ls s3://dandiarchive/zarr/ | head
                          DIR  s3://dandiarchive/zarr/020f7130-3a59-4140-b01d-ac2180917b05/
                          DIR  s3://dandiarchive/zarr/02499e55-c945-4af9-a9d8-d9072d94959c/
                          DIR  s3://dandiarchive/zarr/0316a531-decb-4401-99b7-5d15e8c3dcec/
                          DIR  s3://dandiarchive/zarr/031bf698-6917-4294-a086-61a2454e0a07/
...

"original" zarr datasets will correspond to their zarr_ids as the ones used on s3 as prefixes under zarr/ prefix

I can't tell what you're trying to say here.

pretty much what you asked (and I answered) about above ;-)

jwodder commented 2 years ago

@yarikoptic

yarikoptic commented 2 years ago
  • Currently, when committing changes to backup datasets, the commit timestamp is set to the latest creation date of the assets involved in the commit; however, Zarrs can be modified after they are created, so this could produce misleading results. Should the commit timestamp be based on asset modification dates instead?

for the commit which would create the subdataset for zarr (if committing separately) would be worthwhile using creation time. If you would not be committing in dandiset's git repo at the moment of creation, then let's use the datetime of the commit to be committed in that .zarr subdataset. And commit in the .zarr subdataset should have datetime corresponding to the latest datetime of a file in that .zarr. I didn't check if we have that information from API or are we doomed to talk to S3 API?

Do we have any idea how publishing Dandisets with Zarrs is going to work? Will publishing such a Dandiset produces a version with immutable copies of the Zarrs with different Zarr IDs? Will the IDs of the Zarrs in the draft remain the same or be changed afterwards?

I think this all is still to be decided upon, so AFAIK publishing of dandisets with zarrs is disabled and we should error out if we run into such siutation. BUT meanwhile, in case of datalad dandisets, while bucket is still versioned, we just need to make sure to use versioned URLs to S3.

jwodder commented 2 years ago

@yarikoptic

for the commit which would create the subdataset for zarr (if committing separately) would be worthwhile using creation time.

Do you mean the initial commit(s) created when Dataset.create() is called?

If you would not be committing in dandiset's git repo at the moment of creation,

I would not.

then let's use the datetime of the commit to be committed in that .zarr subdataset.

I don't know what you mean by this.

And commit in the .zarr subdataset should have datetime corresponding to the latest datetime of a file in that .zarr. I didn't check if we have that information from API or are we doomed to talk to S3 API?

Zarr entry timestamps can only be retrieved via S3.

yarikoptic commented 2 years ago

Do you mean the initial commit(s) created when Dataset.create() is called?

yes, since that one would do some commits (e.g. to commit .datalad/config)

then let's use the datetime of the commit to be committed in that .zarr subdataset.

I don't know what you mean by this.

in the dandiset's datalad dataset which is to commit the changes to .zarr/ subdataset (if already committed separately), commit using datetime of the last commit in .zarr/ subdataset which would represent the datetime of its change. Per below, may be it could be just a single call the_dandiset.save(zarr_path, recursive=True) after overloading datetime to correspond to the zarr modification time and that should produce those two commits (in zarr_path and the_dandiset) with the same datetime.

And commit in the .zarr subdataset should have datetime corresponding to the latest datetime of a file in that .zarr. I didn't check if we have that information from API or are we doomed to talk to S3 API?

Zarr entry timestamps can only be retrieved via S3.

oh, that sucks... then I guess we should take modified time for that entire zarr (not asset it belongs to) -- do we get that timestamp somewhere

jwodder commented 2 years ago

@yarikoptic Are you expecting the backup script to only create one repository per Zarr, that repository being a submodule of the respective Dandiset's dataset? I assumed that the Zarrs repositories would be created under /mnt/backup/dandi/dandizarrs or /mnt/backup/dandi/dandisets/zarrs and then submodules pointing to either those repositories or their GitHub mirrors would be created under the Dandiset datasets. Keep in mind that, although it cannot be done through dandi-cli, a user of the Dandi Archive API could create a Dandiset in which the same Zarr is present at two different asset paths.

oh, that sucks... then I guess we should take modified time for that entire zarr (not asset it belongs to) -- do we get that timestamp somewhere

We still need to query S3 to get files' sizes and their versioned AWS URLs, and all S3 queries can be done in a single request per entry.

yarikoptic commented 2 years ago

@yarikoptic Are you expecting the backup script to only create one repository per Zarr, that repository being a submodule of the respective Dandiset's dataset? I assumed that the Zarrs repositories would be created under /mnt/backup/dandi/dandizarrs or /mnt/backup/dandi/dandisets/zarrs and then submodules pointing to either those repositories or their GitHub mirrors would be created under the Dandiset datasets. Keep in mind that, although it cannot be done through dandi-cli, a user of the Dandi Archive API could create a Dandiset in which the same Zarr is present at two different asset paths.

right, I forgot that aspect of the design -- we do have all of them under /mnt/backup/dandi/dandizarrs ("mirrored" under https://github.com/dandizarrs) and only installed/updated/uninstalled in any particular dandiset so we do updates in their dandizarrs/{zarr_id} location, then have to just datalad update --how ff-only their installation(s) in the corresponding dandiset (where/when we encountered zarr), and commit possibly changed state of that zarr subdataset in the corresponding dandiset.

We still need to query S3 to get files' sizes and their versioned AWS URLs, and all S3 queries can be done in a single request per entry.

oh, because of https://github.com/dandi/dandi-archive/issues/925 (to be addressed as a part of the larger https://github.com/dandi/dandi-archive/issues/937)? then may be we should also ask to have mtime to be included too while at it?

jwodder commented 2 years ago

@yarikoptic Could you write out a pseudocode sketch or something of how you envision adding a Zarr repo to a Dandiset dataset working? Right now, pre-Zarr, it roughly works like this:

In particular, once the syncing of the Zarr to its repository under /mnt/backup/dandi/dandizarrs has completed, should the submodule in the Dandiset dataset be added/updated as part of the commit that the other assets in the same version belong to, or as part of a separate commit?

oh, because of https://github.com/dandi/dandi-archive/issues/925 (to be addressed as a part of the larger https://github.com/dandi/dandi-archive/issues/937)? then may be we should also ask to have mtime to be included too while at it?

We would still need to query S3 to get the versioned AWS URL to register for the file in git-annex.

yarikoptic commented 2 years ago
jwodder commented 2 years ago

@yarikoptic

yarikoptic commented 2 years ago
  • The backup script currently does datalad.cfg.set("datalad.repo.backend", "SHA256E", where="override") in order to set the git-annex backend for the Dandiset datasets; how would the backend be set to MD5E without affecting any non-Zarr datasets?

I am a bit lost , but

  • If all the files in a Zarr are deleted, what should the commit timestamp for the Zarr dataset be?

;-) tricky you! you dug it up -- even for non-empty ones we must consider DeleteMarker's (deleted files/keys) datetimes so we have datetime of modification of a .zarr which only got some files removed.

  • Should the default branch for Zarr datasets be "draft" like for Dandiset datasets or something else?

Let's stick with draft indeed for consistency and because they are updated while added to draft datasets

jwodder commented 2 years ago

@yarikoptic

jwodder commented 2 years ago

@yarikoptic Also:

yarikoptic commented 2 years ago

Good questions!

  • It's my understanding that the datalad.cfg.set(...) line sets the backend used for all Dataset creations throughout the process. Is there a way to set it for just the dataset currently being initialized, or do I have to call datalad.cfg.set(...) with "SHA256E" or "MD5E", as appropriate, before every call to Dataset.create()?

I don't think this is possible (any longer). See https://github.com/datalad/datalad/issues/5155 for a TODO context manager and actually an existing one used in the tests datalad.tests.utils.patch_config. NB: Might ask you to PR for a proper context manager within ConfigManager.

  • So you now want the timestamps of DeleteMarkers in Zarr storage to always be taken into account when syncing a Zarr?

Yes, I think we are doomed to do that.

  • Will the contents of a Zarr always be under the prefix https://{bucket}.s3.amazonaws.com/zarr/{zarr_id}/?

Not sure about always, but it is ATM. A similar one is in staging bucket (we have tests using staging, right?)

  • What if someone interrupts a Zarr upload during the first batch — or simply abuses the API — to produce an empty Zarr with no DeleteMarkers?

indeed, neither files nor DeleteMarkers have to exist... so then we could take that zarr creation datetime as a commit datetime

  • Will the backup remote for the Zarr datasets have the same name as the Dandisets backup remote?

Let's call it dandi-dandizarrs-dropbox (instead of dandi-dandisets-dropbox) . I will stay optimistic to upload all the keys to the same huge store across all zarr's as we did across all dandisets

  • How exactly do I configure a Zarr dataset to store the .zarr-checksum file in git instead of git-annex?

Something like

ds.repo.set_gitattributes([
        ('.zarr-checksum', {'annex.largefiles': 'nothing'})])

should do it (unless some rule added later overrides it)

jwodder commented 2 years ago

@yarikoptic For the ds.repo.set_gitattributes() call, that doesn't commit automatically, does it? Do I need to call ds.save() manually for that, or can it be rolled into the dataset initialization commits somehow?

jwodder commented 2 years ago

@yarikoptic Reminder about question in previous comment.

yarikoptic commented 2 years ago

yes, as ds.repo. methods do not "auto save", it is typical for ds. level interfaces to commit. I don't think it is possible to roll into dataset initialization, and not really needed to be in a single commit.

jwodder commented 2 years ago

@yarikoptic Do datalad clone and datalad update make commits? If so, what do I do about that?

yarikoptic commented 2 years ago

@yarikoptic Do datalad clone and datalad update make commits? If so, what do I do about that?

jwodder commented 2 years ago

@yarikoptic So they do create commits in the outer dataset, and that can't be avoided? Should all non-Zarrs assets be committed before cloning/updating Zarr subdatasets?

for clone - ideally we should take the datetime of that asset creation

Take it for what, exactly?

yarikoptic commented 2 years ago

@yarikoptic So they do create commits in the outer dataset, and that can't be avoided?

I don't think we want to avoid commits - they are what would give us idea about the state of a dandiset at that point in time

Should all non-Zarrs assets be committed before cloning/updating Zarr subdatasets?

Not necessarily, as we don't commit per each non zarr asset change. Ideally there should be nothing special about Zarr asset/subdataset in that sense

for clone - ideally we should take the datetime of that asset creation

Take it for what, exactly?

For the commit in dandiset. Since there could be other assets to be saved, I guess we shouldn't do commit -d . , but let to eventual call to save to save it?

jwodder commented 2 years ago

@yarikoptic I'm confused about exactly what should be committed at what point.

So may be lets take "latest(zarr_asset_modification, last_commit_in_dandiset+1ms)"?

Git commit timestamps have one-second resolution, so adding a millisecond is not an option.

yarikoptic commented 2 years ago
  • When a Zarr dataset is cloned into a Dandiset dataset, does this cause a commit to be created in the Dandiset dataset? If so, how should this commit be sequenced in relation to the committing of blob assets?

Echoing my thinking above -- Let's not commit right at that point. clone'ing should be identical to just adding a new asset but not necessarily committing right away. Save/commit-ing should be sequenced as it would be for any other asset. So smth like "add assetX; clone zarr as assetX+1; add assetX+2; ... so on; datalad save -m 'added/updated X assets'"

Same question as above, but for update

same answer -- treat zarr subdataset as any other asset, collapsing multiple updates across assets where our logic says to do that already (we moved away from 1-commit-per-asset awhile back)

Git commit timestamps have one-second resolution, so adding a millisecond is not an option.

then 1 second? or just remove any increment? your choice -- anything which wouldn't throw off logic for minting release commits

jwodder commented 2 years ago

@yarikoptic

Let's not commit right at that point

How? When using the Datalad Python API, how exactly do I invoke clone() and update() so that they don't automatically create a commit in the superdataset?

yarikoptic commented 2 years ago
jwodder commented 2 years ago

@yarikoptic I'm trying to write a test in which a Zarr is uploaded & backed up, then a file in the Zarr becomes a directory, a directory becomes a file, and the Zarr is uploaded & backed up again. However, when calling Dataset.save() after updating the modified Zarr dataset, the invocation of git -c diff.ignoreSubmodules=none commit -m '[backups2datalad] 2 files added, 2 files deleted, checksum updated' -- .zarr-checksum changed01 changed02/file.txt changed01/file.txt changed02 (where changed02 was previously a file and is now a directory) fails with 'changed02' does not have a commit checked out. I am unable to create an MVCE that reproduces this problem.

yarikoptic commented 2 years ago

Cool! Just push that test and I will try it out as well to troubleshoot datalad from there?

jwodder commented 2 years ago

@yarikoptic Pushed. You can run just that test with nox -e test -- -k test_backup_zarr_entry_conflicts.

yarikoptic commented 2 years ago

filed https://github.com/datalad/datalad/issues/6558 . Could you for now disable such a tricky unit test? ;)

jwodder commented 2 years ago

@yarikoptic Test disabled.

jwodder commented 2 years ago

@yarikoptic The populate command needs to be updated to support Zarr datasets. Do you have any preference for how its behavior should be adjusted?

jwodder commented 2 years ago

@yarikoptic Ping.

yarikoptic commented 2 years ago

sorry for the delay... Although I do not like breeding commands, I think for now or forever we would be ok if there is a dedicated populate-zarr or alike command which would go through present zarr datasets and do similar dance. It should also be added to -cron script.

Rationale:

jwodder commented 2 years ago

@yarikoptic

DECIDE: on drogon we will store them either under a dedicated /mnt/backup/dandi/dandizarrs (folder? super dataset?) or can /mnt/backup/dandi/dandisets/zarrs subdataset

What have you decided about this?

yarikoptic commented 2 years ago

Let's just do /mnt/backup/dandi/dandizarrs folder for now: