Provide support for zarr (.zarr/.ngff)

yarikoptic commented 2 years ago

zarr/ refers to a .zarr or .ngff folder asset which are already entering staging server and soon will emerge in the main one. We should be ready (#126 is a stop gap measure so we do not "pollute" our datalad dandisets)

every zarr/ should be a DataLad subdataset
"original" zarr datasets will correspond to their zarr_ids as the ones used on s3 as prefixes under zarr/ prefix (look at s3://dandi-api-staging-dandisets/zarr/)
DECIDE: on drogon we will store them either under a dedicated /mnt/backup/dandi/dandizarrs (folder? super dataset?) or can /mnt/backup/dandi/dandisets/zarrs subdataset
they will be pushed to a corresponding repo on github under https://github.com/dandizarrs organization
we do not bother "mirroring" zarr (on drogon or github) until we encounter it in some dandiset
- pros: not wasting bandwidth
- cons: we are not reacting to ongoing uploads of large zarrs, and will need to get them only whenever encountered
- that is why probably no point to make dandizars into a repo/superdataset - since would not reflect entire state on the bucket anyways.
we will use MD5E git-annex backend, all files in zarr should be single-part upload (we need to verify, is that possible?), so we can mint git annex keys from ETag (md5) without downloading any content! (@yarikoptic is yet to review/adopt #103 for out of banding backup of regular assets)
zarr submodule would be installed/updated from the collection of zarrs on github mentioned above, or removed if zarr is removed in a given dandiset/path.
- destiny of zarr-checksum storage in the archive is yet to be really decided https://github.com/dandi/dandi-cli/issues/915#issue-1140496287, but ideally include/populate the overall zarr tree checksum in the datalad dandiset at least for the entire zarr/.
- since we use MD5 backend, we can use annex keys to relatively (how quick?) quickly compute zarr-checksum of the entire zarr/
- all zarr files must be under annex
- if we decide to keep it, .zarr-checksum or whatever that file to contain overall checksum should reside under git not git-annex

Having established workflow for out-of-band backuping regular assets (based on #103) we will approach backup of zarrs. Some notes:

For dropbox backup special remote, lets use a new prefix and UUID for those to not interfere with non-zarr backup, so use smth like
```
                    "prefix=dandi-dandizarrs/annexstore",
                    "uuid=727f466f-60c3-4778-90b2-b2332856c2f9",
```
where I changed prefix and incremented last number in uuid (is it still legit? ;))

jwodder commented 2 years ago

@yarikoptic

What exactly should the names of the repositories in https://github.com/dandizarrs be set to? The Zarr IDs?
Please try to make a decision on #103 before I start work on this

"original" zarr datasets will correspond to their zarr_ids as the ones used on s3 as prefixes under zarr/ prefix

I can't tell what you're trying to say here.

yarikoptic commented 2 years ago

What exactly should the names of the repositories in https://github.com/dandizarrs be set to? The Zarr IDs?

yes, 1-to-1 to prefix (AKA folder) on S3:

dandi@drogon:~$ s3cmd -c ~/.s3cfg-dandi-backup ls s3://dandiarchive/zarr/ | head
                          DIR  s3://dandiarchive/zarr/020f7130-3a59-4140-b01d-ac2180917b05/
                          DIR  s3://dandiarchive/zarr/02499e55-c945-4af9-a9d8-d9072d94959c/
                          DIR  s3://dandiarchive/zarr/0316a531-decb-4401-99b7-5d15e8c3dcec/
                          DIR  s3://dandiarchive/zarr/031bf698-6917-4294-a086-61a2454e0a07/
...

"original" zarr datasets will correspond to their zarr_ids as the ones used on s3 as prefixes under zarr/ prefix

I can't tell what you're trying to say here.

pretty much what you asked (and I answered) about above ;-)

jwodder commented 2 years ago

@yarikoptic

Currently, when committing changes to backup datasets, the commit timestamp is set to the latest creation date of the assets involved in the commit; however, Zarrs can be modified after they are created, so this could produce misleading results. Should the commit timestamp be based on asset modification dates instead?
Do we have any idea how publishing Dandisets with Zarrs is going to work? Will publishing such a Dandiset produces a version with immutable copies of the Zarrs with different Zarr IDs? Will the IDs of the Zarrs in the draft remain the same or be changed afterwards?

yarikoptic commented 2 years ago

Currently, when committing changes to backup datasets, the commit timestamp is set to the latest creation date of the assets involved in the commit; however, Zarrs can be modified after they are created, so this could produce misleading results. Should the commit timestamp be based on asset modification dates instead?

for the commit which would create the subdataset for zarr (if committing separately) would be worthwhile using creation time. If you would not be committing in dandiset's git repo at the moment of creation, then let's use the datetime of the commit to be committed in that .zarr subdataset. And commit in the .zarr subdataset should have datetime corresponding to the latest datetime of a file in that .zarr. I didn't check if we have that information from API or are we doomed to talk to S3 API?

Do we have any idea how publishing Dandisets with Zarrs is going to work? Will publishing such a Dandiset produces a version with immutable copies of the Zarrs with different Zarr IDs? Will the IDs of the Zarrs in the draft remain the same or be changed afterwards?

I think this all is still to be decided upon, so AFAIK publishing of dandisets with zarrs is disabled and we should error out if we run into such siutation. BUT meanwhile, in case of datalad dandisets, while bucket is still versioned, we just need to make sure to use versioned URLs to S3.

jwodder commented 2 years ago

@yarikoptic

for the commit which would create the subdataset for zarr (if committing separately) would be worthwhile using creation time.

Do you mean the initial commit(s) created when Dataset.create() is called?

If you would not be committing in dandiset's git repo at the moment of creation,

I would not.

then let's use the datetime of the commit to be committed in that .zarr subdataset.

I don't know what you mean by this.

And commit in the .zarr subdataset should have datetime corresponding to the latest datetime of a file in that .zarr. I didn't check if we have that information from API or are we doomed to talk to S3 API?

Zarr entry timestamps can only be retrieved via S3.

yarikoptic commented 2 years ago

Do you mean the initial commit(s) created when Dataset.create() is called?

yes, since that one would do some commits (e.g. to commit .datalad/config)

then let's use the datetime of the commit to be committed in that .zarr subdataset.

I don't know what you mean by this.

in the dandiset's datalad dataset which is to commit the changes to .zarr/ subdataset (if already committed separately), commit using datetime of the last commit in .zarr/ subdataset which would represent the datetime of its change. Per below, may be it could be just a single call the_dandiset.save(zarr_path, recursive=True) after overloading datetime to correspond to the zarr modification time and that should produce those two commits (in zarr_path and the_dandiset) with the same datetime.

And commit in the .zarr subdataset should have datetime corresponding to the latest datetime of a file in that .zarr. I didn't check if we have that information from API or are we doomed to talk to S3 API?

Zarr entry timestamps can only be retrieved via S3.

oh, that sucks... then I guess we should take modified time for that entire zarr (not asset it belongs to) -- do we get that timestamp somewhere

jwodder commented 2 years ago

@yarikoptic Are you expecting the backup script to only create one repository per Zarr, that repository being a submodule of the respective Dandiset's dataset? I assumed that the Zarrs repositories would be created under /mnt/backup/dandi/dandizarrs or /mnt/backup/dandi/dandisets/zarrs and then submodules pointing to either those repositories or their GitHub mirrors would be created under the Dandiset datasets. Keep in mind that, although it cannot be done through dandi-cli, a user of the Dandi Archive API could create a Dandiset in which the same Zarr is present at two different asset paths.

oh, that sucks... then I guess we should take modified time for that entire zarr (not asset it belongs to) -- do we get that timestamp somewhere

We still need to query S3 to get files' sizes and their versioned AWS URLs, and all S3 queries can be done in a single request per entry.

yarikoptic commented 2 years ago

@yarikoptic Are you expecting the backup script to only create one repository per Zarr, that repository being a submodule of the respective Dandiset's dataset? I assumed that the Zarrs repositories would be created under /mnt/backup/dandi/dandizarrs or /mnt/backup/dandi/dandisets/zarrs and then submodules pointing to either those repositories or their GitHub mirrors would be created under the Dandiset datasets. Keep in mind that, although it cannot be done through dandi-cli, a user of the Dandi Archive API could create a Dandiset in which the same Zarr is present at two different asset paths.

right, I forgot that aspect of the design -- we do have all of them under /mnt/backup/dandi/dandizarrs ("mirrored" under https://github.com/dandizarrs) and only installed/updated/uninstalled in any particular dandiset so we do updates in their dandizarrs/{zarr_id} location, then have to just datalad update --how ff-only their installation(s) in the corresponding dandiset (where/when we encountered zarr), and commit possibly changed state of that zarr subdataset in the corresponding dandiset.

We still need to query S3 to get files' sizes and their versioned AWS URLs, and all S3 queries can be done in a single request per entry.

oh, because of https://github.com/dandi/dandi-archive/issues/925 (to be addressed as a part of the larger https://github.com/dandi/dandi-archive/issues/937)? then may be we should also ask to have mtime to be included too while at it?

jwodder commented 2 years ago

@yarikoptic Could you write out a pseudocode sketch or something of how you envision adding a Zarr repo to a Dandiset dataset working? Right now, pre-Zarr, it roughly works like this:

For every Dandiset D:
- Create the dataset, if necessary
- For every version V of D (putting the draft last):
  - For every asset in the draft, in creation order, created between V-1 and V (or the end of time, if V is the draft):
    - Register the asset in the dataset
  - Make a commit, if there were any changes, using the latest creation time of all assets in the version as the commit date
- Delete any files from the dataset that are no longer in the draft; if necessary, make a commit using the draft's modified timestamp as the commit timestamp

In particular, once the syncing of the Zarr to its repository under /mnt/backup/dandi/dandizarrs has completed, should the submodule in the Dandiset dataset be added/updated as part of the commit that the other assets in the same version belong to, or as part of a separate commit?

oh, because of https://github.com/dandi/dandi-archive/issues/925 (to be addressed as a part of the larger https://github.com/dandi/dandi-archive/issues/937)? then may be we should also ask to have mtime to be included too while at it?

We would still need to query S3 to get the versioned AWS URL to register for the file in git-annex.

yarikoptic commented 2 years ago

For every Dandiset D:
- Create the dataset, if necessary
- For every version V of D (putting the draft last):
  - For every asset in the draft, in creation order, created between V-1 and V (or the end of time, if V is the draft):
    - If asset is a .zarr
      - check that /mnt/backup/dandi/dandizarrs/{zarr_id} exists, if not -- create
      - go to /mnt/backup/dandi/dandizarrs/{zarr_id} and update that zarr datalad dataset to correspond to the state in dandi-archive (according to checksum) and on S3, with commit datetime to correspond to the latest mtime among keys on S3
      - if checksum is not available -- do what we do for the assets missing sha256
      - verify that checksum corresponds to the expected one
      - push updated state to github (among https://github.com/dandizarrs)
    - Register the asset in the dataset
      - for .zarr asset that entails:
      - if new -- clone -d . https://github.com/dandizarrs/{zarr_id} {target_path}
      - if exists already -- update -d {target_path} --how ff-only -s github and then save -d {dandiset} {target_path}
  - Make a commit, if there were any changes, using the latest creation time of all assets in the version as the commit date
    - special case: for zarr, use the last commit datetime set above
- Delete any files from the dataset that are no longer in the draft; if necessary, make a commit using the draft's modified timestamp as the commit timestamp

jwodder commented 2 years ago

@yarikoptic

The backup script currently does datalad.cfg.set("datalad.repo.backend", "SHA256E", where="override") in order to set the git-annex backend for the Dandiset datasets; how would the backend be set to MD5E without affecting any non-Zarr datasets?
If all the files in a Zarr are deleted, what should the commit timestamp for the Zarr dataset be?
Should the default branch for Zarr datasets be "draft" like for Dandiset datasets or something else?

yarikoptic commented 2 years ago

The backup script currently does datalad.cfg.set("datalad.repo.backend", "SHA256E", where="override") in order to set the git-annex backend for the Dandiset datasets; how would the backend be set to MD5E without affecting any non-Zarr datasets?

I am a bit lost , but

we should keep it as is (SHA256E) for dandisets datasets
for .zarr datasets we need to either not set it, then it would default to MD5E, or better set it explicitly to MD5E "just to be sure"

If all the files in a Zarr are deleted, what should the commit timestamp for the Zarr dataset be?

;-) tricky you! you dug it up -- even for non-empty ones we must consider DeleteMarker's (deleted files/keys) datetimes so we have datetime of modification of a .zarr which only got some files removed.

Should the default branch for Zarr datasets be "draft" like for Dandiset datasets or something else?

Let's stick with draft indeed for consistency and because they are updated while added to draft datasets

jwodder commented 2 years ago

@yarikoptic

It's my understanding that the datalad.cfg.set(...) line sets the backend used for all Dataset creations throughout the process. Is there a way to set it for just the dataset currently being initialized, or do I have to call datalad.cfg.set(...) with "SHA256E" or "MD5E", as appropriate, before every call to Dataset.create()?
So you now want the timestamps of DeleteMarkers in Zarr storage to always be taken into account when syncing a Zarr? Will the contents of a Zarr always be under the prefix https://{bucket}.s3.amazonaws.com/zarr/{zarr_id}/?
What if someone interrupts a Zarr upload during the first batch — or simply abuses the API — to produce an empty Zarr with no DeleteMarkers?

jwodder commented 2 years ago

@yarikoptic Also:

Will the backup remote for the Zarr datasets have the same name as the Dandisets backup remote?
How exactly do I configure a Zarr dataset to store the .zarr-checksum file in git instead of git-annex?

yarikoptic commented 2 years ago

Good questions!

It's my understanding that the datalad.cfg.set(...) line sets the backend used for all Dataset creations throughout the process. Is there a way to set it for just the dataset currently being initialized, or do I have to call datalad.cfg.set(...) with "SHA256E" or "MD5E", as appropriate, before every call to Dataset.create()?

I don't think this is possible (any longer). See https://github.com/datalad/datalad/issues/5155 for a TODO context manager and actually an existing one used in the tests datalad.tests.utils.patch_config. NB: Might ask you to PR for a proper context manager within ConfigManager.

So you now want the timestamps of DeleteMarkers in Zarr storage to always be taken into account when syncing a Zarr?

Yes, I think we are doomed to do that.

Will the contents of a Zarr always be under the prefix https://{bucket}.s3.amazonaws.com/zarr/{zarr_id}/?

Not sure about always, but it is ATM. A similar one is in staging bucket (we have tests using staging, right?)

What if someone interrupts a Zarr upload during the first batch — or simply abuses the API — to produce an empty Zarr with no DeleteMarkers?

indeed, neither files nor DeleteMarkers have to exist... so then we could take that zarr creation datetime as a commit datetime

Will the backup remote for the Zarr datasets have the same name as the Dandisets backup remote?

Let's call it dandi-dandizarrs-dropbox (instead of dandi-dandisets-dropbox) . I will stay optimistic to upload all the keys to the same huge store across all zarr's as we did across all dandisets

How exactly do I configure a Zarr dataset to store the .zarr-checksum file in git instead of git-annex?

Something like

ds.repo.set_gitattributes([
        ('.zarr-checksum', {'annex.largefiles': 'nothing'})])

should do it (unless some rule added later overrides it)

jwodder commented 2 years ago

@yarikoptic For the ds.repo.set_gitattributes() call, that doesn't commit automatically, does it? Do I need to call ds.save() manually for that, or can it be rolled into the dataset initialization commits somehow?

jwodder commented 2 years ago

@yarikoptic Reminder about question in previous comment.

yarikoptic commented 2 years ago

yes, as ds.repo. methods do not "auto save", it is typical for ds. level interfaces to commit. I don't think it is possible to roll into dataset initialization, and not really needed to be in a single commit.

jwodder commented 2 years ago

@yarikoptic Do datalad clone and datalad update make commits? If so, what do I do about that?

yarikoptic commented 2 years ago

@yarikoptic Do datalad clone and datalad update make commits? If so, what do I do about that?

for clone - ideally we should take the datetime of that asset creation I guess since it would be the best thing to correspond (even if datetime of commit it points in the zarr subdataset is later )
for update - since ff-only, in the zarr subdataset there would be no new commit. In the dandiset -- becomes trickier. I would have just taken the asset modification date, but I am afraid that might lead to non-linear history :-/ (some other asset is created/modified after zarr asset mutates internally) So may be lets take "latest(zarr_asset_modification, last_commit_in_dandiset+1ms)"? WDYT @jwodder ? my main worry is to not interfere with other logic which would rely on dates etc

jwodder commented 2 years ago

@yarikoptic So they do create commits in the outer dataset, and that can't be avoided? Should all non-Zarrs assets be committed before cloning/updating Zarr subdatasets?

for clone - ideally we should take the datetime of that asset creation

Take it for what, exactly?

yarikoptic commented 2 years ago

@yarikoptic So they do create commits in the outer dataset, and that can't be avoided?

I don't think we want to avoid commits - they are what would give us idea about the state of a dandiset at that point in time

Should all non-Zarrs assets be committed before cloning/updating Zarr subdatasets?

Not necessarily, as we don't commit per each non zarr asset change. Ideally there should be nothing special about Zarr asset/subdataset in that sense

for clone - ideally we should take the datetime of that asset creation

Take it for what, exactly?

For the commit in dandiset. Since there could be other assets to be saved, I guess we shouldn't do commit -d . , but let to eventual call to save to save it?

jwodder commented 2 years ago

@yarikoptic I'm confused about exactly what should be committed at what point.

When a Zarr dataset is cloned into a Dandiset dataset, does this cause a commit to be created in the Dandiset dataset? If so, how should this commit be sequenced in relation to the committing of blob assets?
Same question as above, but for update

So may be lets take "latest(zarr_asset_modification, last_commit_in_dandiset+1ms)"?

Git commit timestamps have one-second resolution, so adding a millisecond is not an option.

yarikoptic commented 2 years ago

When a Zarr dataset is cloned into a Dandiset dataset, does this cause a commit to be created in the Dandiset dataset? If so, how should this commit be sequenced in relation to the committing of blob assets?

Echoing my thinking above -- Let's not commit right at that point. clone'ing should be identical to just adding a new asset but not necessarily committing right away. Save/commit-ing should be sequenced as it would be for any other asset. So smth like "add assetX; clone zarr as assetX+1; add assetX+2; ... so on; datalad save -m 'added/updated X assets'"

Same question as above, but for update

same answer -- treat zarr subdataset as any other asset, collapsing multiple updates across assets where our logic says to do that already (we moved away from 1-commit-per-asset awhile back)

Git commit timestamps have one-second resolution, so adding a millisecond is not an option.

then 1 second? or just remove any increment? your choice -- anything which wouldn't throw off logic for minting release commits

jwodder commented 2 years ago

@yarikoptic

Let's not commit right at that point

How? When using the Datalad Python API, how exactly do I invoke clone() and update() so that they don't automatically create a commit in the superdataset?

yarikoptic commented 2 years ago

datalad.ui.clone without providing dataset= kwarg
zarr_subdataset.update(...) would trigger update in that zarr_subdataset, but would not commit in (unknown to such call) dandiset dataset which it is a submodule of

jwodder commented 2 years ago

@yarikoptic I'm trying to write a test in which a Zarr is uploaded & backed up, then a file in the Zarr becomes a directory, a directory becomes a file, and the Zarr is uploaded & backed up again. However, when calling Dataset.save() after updating the modified Zarr dataset, the invocation of git -c diff.ignoreSubmodules=none commit -m '[backups2datalad] 2 files added, 2 files deleted, checksum updated' -- .zarr-checksum changed01 changed02/file.txt changed01/file.txt changed02 (where changed02 was previously a file and is now a directory) fails with 'changed02' does not have a commit checked out. I am unable to create an MVCE that reproduces this problem.

yarikoptic commented 2 years ago

Cool! Just push that test and I will try it out as well to troubleshoot datalad from there?

jwodder commented 2 years ago

@yarikoptic Pushed. You can run just that test with nox -e test -- -k test_backup_zarr_entry_conflicts.

yarikoptic commented 2 years ago

filed https://github.com/datalad/datalad/issues/6558 . Could you for now disable such a tricky unit test? ;)

jwodder commented 2 years ago

@yarikoptic Test disabled.

jwodder commented 2 years ago

@yarikoptic The populate command needs to be updated to support Zarr datasets. Do you have any preference for how its behavior should be adjusted?

Should populate operate on the datasets in the Zarr folder or on subdatasets of Dandiset datasets?
Should there be two populate commands, one for Dandiset datasets and one for Zarr datasets?
If there should be only one populate command, should it be called twice (once for Dandisets, once for Zarrs), or should it handle everything in one call?

jwodder commented 2 years ago

@yarikoptic Ping.

yarikoptic commented 2 years ago

sorry for the delay... Although I do not like breeding commands, I think for now or forever we would be ok if there is a dedicated populate-zarr or alike command which would go through present zarr datasets and do similar dance. It should also be added to -cron script.

Rationale:

current tree of a dandiset which used to have zarr might no longer have it, thus we might miss a zarr to be backed up which missed it turn somehow.
it might be convenient/useful to be able to trigger populate for zarrs only to e.g. ensure that we got those covered

jwodder commented 2 years ago

@yarikoptic

DECIDE: on drogon we will store them either under a dedicated /mnt/backup/dandi/dandizarrs (folder? super dataset?) or can /mnt/backup/dandi/dandisets/zarrs subdataset

What have you decided about this?

yarikoptic commented 2 years ago

Let's just do /mnt/backup/dandi/dandizarrs folder for now:

benefit from joining into a superdataset would be minimal IMHO -- the "state" of the dandiarchive for a date would be accessible through dandisets into those zarrs (as exposed on github)
saveing modified state in datalad dataset could be expensive since those zarrs would be monsters in number of files

dandi / dandisets

Provide support for zarr (.zarr/.ngff) #127