Closed yarikoptic closed 2 years ago
@yarikoptic
"original" zarr datasets will correspond to their zarr_ids as the ones used on s3 as prefixes under zarr/ prefix
I can't tell what you're trying to say here.
- What exactly should the names of the repositories in https://github.com/dandizarrs be set to? The Zarr IDs?
yes, 1-to-1 to prefix (AKA folder) on S3:
dandi@drogon:~$ s3cmd -c ~/.s3cfg-dandi-backup ls s3://dandiarchive/zarr/ | head
DIR s3://dandiarchive/zarr/020f7130-3a59-4140-b01d-ac2180917b05/
DIR s3://dandiarchive/zarr/02499e55-c945-4af9-a9d8-d9072d94959c/
DIR s3://dandiarchive/zarr/0316a531-decb-4401-99b7-5d15e8c3dcec/
DIR s3://dandiarchive/zarr/031bf698-6917-4294-a086-61a2454e0a07/
...
"original" zarr datasets will correspond to their zarr_ids as the ones used on s3 as prefixes under zarr/ prefix
I can't tell what you're trying to say here.
pretty much what you asked (and I answered) about above ;-)
@yarikoptic
Currently, when committing changes to backup datasets, the commit timestamp is set to the latest creation date of the assets involved in the commit; however, Zarrs can be modified after they are created, so this could produce misleading results. Should the commit timestamp be based on asset modification dates instead?
Do we have any idea how publishing Dandisets with Zarrs is going to work? Will publishing such a Dandiset produces a version with immutable copies of the Zarrs with different Zarr IDs? Will the IDs of the Zarrs in the draft remain the same or be changed afterwards?
- Currently, when committing changes to backup datasets, the commit timestamp is set to the latest creation date of the assets involved in the commit; however, Zarrs can be modified after they are created, so this could produce misleading results. Should the commit timestamp be based on asset modification dates instead?
for the commit which would create the subdataset for zarr (if committing separately) would be worthwhile using creation time. If you would not be committing in dandiset's git repo at the moment of creation, then let's use the datetime of the commit to be committed in that .zarr
subdataset. And commit in the .zarr
subdataset should have datetime corresponding to the latest datetime of a file in that .zarr. I didn't check if we have that information from API or are we doomed to talk to S3 API?
Do we have any idea how publishing Dandisets with Zarrs is going to work? Will publishing such a Dandiset produces a version with immutable copies of the Zarrs with different Zarr IDs? Will the IDs of the Zarrs in the draft remain the same or be changed afterwards?
I think this all is still to be decided upon, so AFAIK publishing of dandisets with zarrs is disabled and we should error out if we run into such siutation. BUT meanwhile, in case of datalad dandisets, while bucket is still versioned, we just need to make sure to use versioned URLs to S3.
@yarikoptic
for the commit which would create the subdataset for zarr (if committing separately) would be worthwhile using creation time.
Do you mean the initial commit(s) created when Dataset.create()
is called?
If you would not be committing in dandiset's git repo at the moment of creation,
I would not.
then let's use the datetime of the commit to be committed in that .zarr subdataset.
I don't know what you mean by this.
And commit in the .zarr subdataset should have datetime corresponding to the latest datetime of a file in that .zarr. I didn't check if we have that information from API or are we doomed to talk to S3 API?
Zarr entry timestamps can only be retrieved via S3.
Do you mean the initial commit(s) created when
Dataset.create()
is called?
yes, since that one would do some commits (e.g. to commit .datalad/config
)
then let's use the datetime of the commit to be committed in that .zarr subdataset.
I don't know what you mean by this.
in the dandiset's datalad dataset which is to commit the changes to .zarr/
subdataset (if already committed separately), commit using datetime of the last commit in .zarr/
subdataset which would represent the datetime of its change. Per below, may be it could be just a single call the_dandiset.save(zarr_path, recursive=True)
after overloading datetime to correspond to the zarr modification time and that should produce those two commits (in zarr_path
and the_dandiset
) with the same datetime.
And commit in the .zarr subdataset should have datetime corresponding to the latest datetime of a file in that .zarr. I didn't check if we have that information from API or are we doomed to talk to S3 API?
Zarr entry timestamps can only be retrieved via S3.
oh, that sucks... then I guess we should take modified time for that entire zarr (not asset it belongs to) -- do we get that timestamp somewhere
@yarikoptic Are you expecting the backup script to only create one repository per Zarr, that repository being a submodule of the respective Dandiset's dataset? I assumed that the Zarrs repositories would be created under /mnt/backup/dandi/dandizarrs
or /mnt/backup/dandi/dandisets/zarrs
and then submodules pointing to either those repositories or their GitHub mirrors would be created under the Dandiset datasets. Keep in mind that, although it cannot be done through dandi-cli, a user of the Dandi Archive API could create a Dandiset in which the same Zarr is present at two different asset paths.
oh, that sucks... then I guess we should take modified time for that entire zarr (not asset it belongs to) -- do we get that timestamp somewhere
We still need to query S3 to get files' sizes and their versioned AWS URLs, and all S3 queries can be done in a single request per entry.
@yarikoptic Are you expecting the backup script to only create one repository per Zarr, that repository being a submodule of the respective Dandiset's dataset? I assumed that the Zarrs repositories would be created under
/mnt/backup/dandi/dandizarrs
or/mnt/backup/dandi/dandisets/zarrs
and then submodules pointing to either those repositories or their GitHub mirrors would be created under the Dandiset datasets. Keep in mind that, although it cannot be done through dandi-cli, a user of the Dandi Archive API could create a Dandiset in which the same Zarr is present at two different asset paths.
right, I forgot that aspect of the design -- we do have all of them under /mnt/backup/dandi/dandizarrs
("mirrored" under https://github.com/dandizarrs) and only installed/updated/uninstalled in any particular dandiset so we do updates in their dandizarrs/{zarr_id}
location, then have to just datalad update --how ff-only
their installation(s) in the corresponding dandiset (where/when we encountered zarr), and commit possibly changed state of that zarr subdataset in the corresponding dandiset.
We still need to query S3 to get files' sizes and their versioned AWS URLs, and all S3 queries can be done in a single request per entry.
oh, because of https://github.com/dandi/dandi-archive/issues/925 (to be addressed as a part of the larger https://github.com/dandi/dandi-archive/issues/937)? then may be we should also ask to have mtime to be included too while at it?
@yarikoptic Could you write out a pseudocode sketch or something of how you envision adding a Zarr repo to a Dandiset dataset working? Right now, pre-Zarr, it roughly works like this:
In particular, once the syncing of the Zarr to its repository under /mnt/backup/dandi/dandizarrs
has completed, should the submodule in the Dandiset dataset be added/updated as part of the commit that the other assets in the same version belong to, or as part of a separate commit?
oh, because of https://github.com/dandi/dandi-archive/issues/925 (to be addressed as a part of the larger https://github.com/dandi/dandi-archive/issues/937)? then may be we should also ask to have mtime to be included too while at it?
We would still need to query S3 to get the versioned AWS URL to register for the file in git-annex.
.zarr
/mnt/backup/dandi/dandizarrs/{zarr_id}
exists, if not -- create/mnt/backup/dandi/dandizarrs/{zarr_id}
and update that zarr datalad dataset to correspond to the state in dandi-archive (according to checksum) and on S3, with commit datetime to correspond to the latest mtime among keys on S3.zarr
asset that entails:clone -d . https://github.com/dandizarrs/{zarr_id} {target_path}
update -d {target_path} --how ff-only -s github
and then save -d {dandiset} {target_path}
@yarikoptic
datalad.cfg.set("datalad.repo.backend", "SHA256E", where="override")
in order to set the git-annex backend for the Dandiset datasets; how would the backend be set to MD5E without affecting any non-Zarr datasets?
- The backup script currently does
datalad.cfg.set("datalad.repo.backend", "SHA256E", where="override")
in order to set the git-annex backend for the Dandiset datasets; how would the backend be set to MD5E without affecting any non-Zarr datasets?
I am a bit lost , but
- If all the files in a Zarr are deleted, what should the commit timestamp for the Zarr dataset be?
;-) tricky you! you dug it up -- even for non-empty ones we must consider DeleteMarker's (deleted files/keys) datetimes so we have datetime of modification of a .zarr
which only got some files removed.
- Should the default branch for Zarr datasets be "draft" like for Dandiset datasets or something else?
Let's stick with draft
indeed for consistency and because they are updated while added to draft datasets
@yarikoptic
datalad.cfg.set(...)
line sets the backend used for all Dataset creations throughout the process. Is there a way to set it for just the dataset currently being initialized, or do I have to call datalad.cfg.set(...)
with "SHA256E" or "MD5E", as appropriate, before every call to Dataset.create()
?https://{bucket}.s3.amazonaws.com/zarr/{zarr_id}/
?@yarikoptic Also:
.zarr-checksum
file in git instead of git-annex?Good questions!
- It's my understanding that the
datalad.cfg.set(...)
line sets the backend used for all Dataset creations throughout the process. Is there a way to set it for just the dataset currently being initialized, or do I have to calldatalad.cfg.set(...)
with "SHA256E" or "MD5E", as appropriate, before every call toDataset.create()
?
I don't think this is possible (any longer). See https://github.com/datalad/datalad/issues/5155 for a TODO context manager and actually an existing one used in the tests datalad.tests.utils.patch_config
. NB: Might ask you to PR for a proper context manager within ConfigManager.
- So you now want the timestamps of DeleteMarkers in Zarr storage to always be taken into account when syncing a Zarr?
Yes, I think we are doomed to do that.
- Will the contents of a Zarr always be under the prefix
https://{bucket}.s3.amazonaws.com/zarr/{zarr_id}/
?
Not sure about always, but it is ATM. A similar one is in staging bucket (we have tests using staging, right?)
- What if someone interrupts a Zarr upload during the first batch — or simply abuses the API — to produce an empty Zarr with no DeleteMarkers?
indeed, neither files nor DeleteMarkers have to exist... so then we could take that zarr creation datetime as a commit datetime
- Will the backup remote for the Zarr datasets have the same name as the Dandisets backup remote?
Let's call it dandi-dandizarrs-dropbox
(instead of dandi-dandisets-dropbox
) . I will stay optimistic to upload all the keys to the same huge store across all zarr's as we did across all dandisets
- How exactly do I configure a Zarr dataset to store the
.zarr-checksum
file in git instead of git-annex?
Something like
ds.repo.set_gitattributes([
('.zarr-checksum', {'annex.largefiles': 'nothing'})])
should do it (unless some rule added later overrides it)
@yarikoptic For the ds.repo.set_gitattributes()
call, that doesn't commit automatically, does it? Do I need to call ds.save()
manually for that, or can it be rolled into the dataset initialization commits somehow?
@yarikoptic Reminder about question in previous comment.
yes, as ds.repo.
methods do not "auto save", it is typical for ds.
level interfaces to commit. I don't think it is possible to roll into dataset initialization, and not really needed to be in a single commit.
@yarikoptic Do datalad clone
and datalad update
make commits? If so, what do I do about that?
@yarikoptic Do
datalad clone
anddatalad update
make commits? If so, what do I do about that?
clone
- ideally we should take the datetime of that asset creation I guess since it would be the best thing to correspond (even if datetime of commit it points in the zarr subdataset is later )update
- since ff-only
, in the zarr subdataset there would be no new commit. In the dandiset -- becomes trickier. I would have just taken the asset modification date, but I am afraid that might lead to non-linear history :-/ (some other asset is created/modified after zarr asset mutates internally) So may be lets take "latest(zarr_asset_modification, last_commit_in_dandiset+1ms)"? WDYT @jwodder ? my main worry is to not interfere with other logic which would rely on dates etc@yarikoptic So they do create commits in the outer dataset, and that can't be avoided? Should all non-Zarrs assets be committed before cloning/updating Zarr subdatasets?
for clone - ideally we should take the datetime of that asset creation
Take it for what, exactly?
@yarikoptic So they do create commits in the outer dataset, and that can't be avoided?
I don't think we want to avoid commits - they are what would give us idea about the state of a dandiset at that point in time
Should all non-Zarrs assets be committed before cloning/updating Zarr subdatasets?
Not necessarily, as we don't commit per each non zarr asset change. Ideally there should be nothing special about Zarr asset/subdataset in that sense
for clone - ideally we should take the datetime of that asset creation
Take it for what, exactly?
For the commit in dandiset. Since there could be other assets to be saved, I guess we shouldn't do commit -d . , but let to eventual call to save to save it?
@yarikoptic I'm confused about exactly what should be committed at what point.
clone
d into a Dandiset dataset, does this cause a commit to be created in the Dandiset dataset? If so, how should this commit be sequenced in relation to the committing of blob assets?update
So may be lets take "latest(zarr_asset_modification, last_commit_in_dandiset+1ms)"?
Git commit timestamps have one-second resolution, so adding a millisecond is not an option.
- When a Zarr dataset is
clone
d into a Dandiset dataset, does this cause a commit to be created in the Dandiset dataset? If so, how should this commit be sequenced in relation to the committing of blob assets?
Echoing my thinking above -- Let's not commit right at that point. clone
'ing should be identical to just adding a new asset but not necessarily committing right away. Save/commit-ing should be sequenced as it would be for any other asset. So smth like "add assetX; clone zarr as assetX+1; add assetX+2; ... so on; datalad save -m 'added/updated X assets'"
Same question as above, but for update
same answer -- treat zarr subdataset as any other asset, collapsing multiple updates across assets where our logic says to do that already (we moved away from 1-commit-per-asset awhile back)
Git commit timestamps have one-second resolution, so adding a millisecond is not an option.
then 1 second? or just remove any increment? your choice -- anything which wouldn't throw off logic for minting release commits
@yarikoptic
Let's not commit right at that point
How? When using the Datalad Python API, how exactly do I invoke clone()
and update()
so that they don't automatically create a commit in the superdataset?
datalad.ui.clone
without providing dataset=
kwargzarr_subdataset.update(...)
would trigger update in that zarr_subdataset, but would not commit in (unknown to such call) dandiset dataset which it is a submodule of@yarikoptic I'm trying to write a test in which a Zarr is uploaded & backed up, then a file in the Zarr becomes a directory, a directory becomes a file, and the Zarr is uploaded & backed up again. However, when calling Dataset.save()
after updating the modified Zarr dataset, the invocation of git -c diff.ignoreSubmodules=none commit -m '[backups2datalad] 2 files added, 2 files deleted, checksum updated' -- .zarr-checksum changed01 changed02/file.txt changed01/file.txt changed02
(where changed02
was previously a file and is now a directory) fails with 'changed02' does not have a commit checked out
. I am unable to create an MVCE that reproduces this problem.
Cool! Just push that test and I will try it out as well to troubleshoot datalad from there?
@yarikoptic Pushed. You can run just that test with nox -e test -- -k test_backup_zarr_entry_conflicts
.
filed https://github.com/datalad/datalad/issues/6558 . Could you for now disable such a tricky unit test? ;)
@yarikoptic Test disabled.
@yarikoptic The populate
command needs to be updated to support Zarr datasets. Do you have any preference for how its behavior should be adjusted?
populate
operate on the datasets in the Zarr folder or on subdatasets of Dandiset datasets?populate
commands, one for Dandiset datasets and one for Zarr datasets?populate
command, should it be called twice (once for Dandisets, once for Zarrs), or should it handle everything in one call?@yarikoptic Ping.
sorry for the delay... Although I do not like breeding commands, I think for now or forever we would be ok if there is a dedicated populate-zarr
or alike command which would go through present zarr datasets and do similar dance. It should also be added to -cron script.
Rationale:
@yarikoptic
DECIDE: on drogon we will store them either under a dedicated
/mnt/backup/dandi/dandizarrs
(folder? super dataset?) or can/mnt/backup/dandi/dandisets/zarrs
subdataset
What have you decided about this?
Let's just do /mnt/backup/dandi/dandizarrs folder for now:
save
ing modified state in datalad dataset could be expensive since those zarrs would be monsters in number of files
zarr/
refers to a .zarr or .ngff folder asset which are already entering staging server and soon will emerge in the main one. We should be ready (#126 is a stop gap measure so we do not "pollute" our datalad dandisets)zarr/
should be a DataLad subdatasetzarr_id
s as the ones used on s3 as prefixes underzarr/
prefix (look ats3://dandi-api-staging-dandisets/zarr/
)/mnt/backup/dandi/dandizarrs
(folder? super dataset?) or can/mnt/backup/dandi/dandisets/zarrs
subdatasetdandizars
into a repo/superdataset - since would not reflect entire state on the bucket anyways.remove
d if zarr is removed in a given dandiset/path.zarr/
.zarr/
.zarr-checksum
or whatever that file to contain overall checksum should reside undergit
notgit-annex
Having established workflow for out-of-band backuping regular assets (based on #103) we will approach backup of zarrs. Some notes:
where I changed prefix and incremented last number in uuid (is it still legit? ;))