dandi / dandisets

783 Dandisets, 819.0 TB total. DataLad super-dataset of all Dandisets from https://github.com/dandisets
10 stars 0 forks source link

migrate key urls over to new dandi-api from the girder blob store #34

Closed yarikoptic closed 3 years ago

yarikoptic commented 3 years ago

girder will eventually deprecated. We will need to do the following

whenever girder is "frozen" and all data must have been uploaded to dandi-api

Which URL?

WDYT @satra, which of the two URLs we should use or may be I have missed another opportunity?

satra commented 3 years ago

i was thinking about this yesterday as i was trying to figure when and what contentURL would go into the asset metadata.

i think you should use /dandisets/{version__dandiset__pk}/versions/{version__version}/assets/{uuid}/download/ it's ok if assets are mutable (for now). you have no guarantees that someone else did not delete things from the dandiset, so you will need to comb through the entire asset list. and if you can iterate over the asset list, you will get path, uuid, and therefore can generate download link.

yarikoptic commented 3 years ago

i think you should use /dandisets/{version__dandiset__pk}/versions/{version__version}/assets/{uuid}/download/ it's ok if assets are mutable (for now).

nah -- it would be just a waste of bandwidth for git-annex which would download a file only to discard it as checksum would not match (in comparison to failing right away if the blob is no longer available), and impossible to verify reliably if original content is still available before dropping it.
If we would like to RF the server to make assets immutable (i.e. go to the original design) - we should put it on the short-term roadmap.

As for assets metadata in exported manifests on S3 -- those most likely should be urls directly to S3 to make those manifests usable without interacting with dandi archive.

So may be contentURL should be a list which would have all 3 ;-)?

satra commented 3 years ago

why would it be a waste of bandwidth, you can download metadata which has all the pieces. you can check locally if the remote checksum is the same as the local checksum and update or discard as necessary.

can't you do all the operations based on the metadata without downloading any file? (as we have discussed before about creating git-annex datasets?)

yarikoptic commented 3 years ago

Are you suggesting to develop dandi special remote for git annex to access files from the archive via two calls (first metadata, then file) with tightly binding annex key backend and the choice of checksumming in the archive?

And that is just to overcome current shortcoming in the archive design and to gain (only potentially) better accountability? IMHO an overkill for little to no gain.

yarikoptic commented 3 years ago

37 is already adding new blobs urls for new assets. Now we just need to do one time migration of existing girder store urls to blobs store:

jwodder commented 3 years ago

@yarikoptic Do you just want the new S3 URLs to be added to the assets, or should the download URLs be included as well?

instantiate AnnexRepo with always_commit=False

How do I do that when retrieving the AnnexRepo via dataset.repo?

well -- both rmurl and registerurl have --batch mode but it seems we have not exposed it in DataLad's interface

add_url_to_file claims to support batch=True. Should I be using that or registerurl?

yarikoptic commented 3 years ago

@yarikoptic Do you just want the new S3 URLs to be added to the assets, or should the download URLs be included as well?

Let's add both. Unfortunately I found no way to prioritize direct one over api (why to torture api server without need), so filed a /todo against git-annex: https://git-annex.branchable.com/todo/assign_costs_per_URL_or_better_repo-wide___40__regexes__41__/ . When/if implemented, then we could prioritize direct one. Note: please add direct S3 urls with ?versionId=. Relevant discussion/reasoning also at https://github.com/dandi/dandi-api/issues/231

an alternative could be - add only direct S3 urls now but then, after prioritization made possible, we would add api urls. But not sure if we should delay that

instantiate AnnexRepo with always_commit=False

How do I do that when retrieving the AnnexRepo via dataset.repo?

well -- both rmurl and registerurl have --batch mode but it seems we have not exposed it in DataLad's interface

add_url_to_file claims to support batch=True. Should I be using that or registerurl?

I have tested that we can consecutively call `annex addurl` on an existing file without causing download (vaguely remember now that support for that was added to git-annex awhile back) ```shell $> datalad install ///dandi/dandisets/000003 [INFO ] Scanning for unlocked files (this may take some time) [INFO ] access to 1 dataset sibling dandi-dandisets-dropbox not auto-enabled, enable with: | datalad siblings -d "/tmp/000003" enable -s dandi-dandisets-dropbox install(ok): /tmp/000003 (dataset) (dev3) (datalad-test-annex) 1 77409 [2].....................................:Fri 30 Apr 2021 10:05:13 AM EDT:. lena:/tmp $> cd 000003 dandiset.yaml sub-YutaMouse23/ sub-YutaMouse37/ sub-YutaMouse39/ sub-YutaMouse41/ sub-YutaMouse44/ sub-YutaMouse51/ sub-YutaMouse55/ sub-YutaMouse57/ sub-YutaMouse20/ sub-YutaMouse33/ sub-YutaMouse38/ sub-YutaMouse40/ sub-YutaMouse42/ sub-YutaMouse45/ sub-YutaMouse54/ sub-YutaMouse56/ (dev3) (datalad-test-annex) 1 77410 [2].....................................:Fri 30 Apr 2021 10:05:17 AM EDT:. (git-annex)lena:/tmp/000003[master] $> cd sub-YutaMouse20 sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb@ sub-YutaMouse20_ses-YutaMouse20-140325_behavior+ecephys.nwb@ sub-YutaMouse20_ses-YutaMouse20-140328_behavior+ecephys.nwb@ sub-YutaMouse20_ses-YutaMouse20-140324_behavior+ecephys.nwb@ sub-YutaMouse20_ses-YutaMouse20-140327_behavior+ecephys.nwb@ (dev3) (datalad-test-annex) 1 77411 [2].....................................:Fri 30 Apr 2021 10:05:26 AM EDT:. (git-annex)lena:/tmp/000003[master]sub-YutaMouse20 $> git annex addurl --file sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb https://api.dandiarchive.org/api/dandisets/000003/versions/draft/assets/05a80228-04a7-4c3b-88d3-44a0c6b831b1/download/ addurl https://api.dandiarchive.org/api/dandisets/000003/versions/draft/assets/05a80228-04a7-4c3b-88d3-44a0c6b831b1/download/ ok (recording state in git...) (dev3) (datalad-test-annex) 1 77412 [2].....................................:Fri 30 Apr 2021 10:06:10 AM EDT:. (git-annex)lena:/tmp/000003[master]sub-YutaMouse20 $> git annex whereis sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb whereis sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb (2 copies) 00000000-0000-0000-0000-000000000001 -- web b7fcf214-e492-4f2c-8789-708af9fd4656 -- dandi@drogon:/mnt/backup/dandi/dandisets/000003 The following untrusted locations may also have copies: 727f466f-60c3-4778-90b2-b2332856c2f8 -- dandi-dandisets-dropbox web: https://api.dandiarchive.org/api/dandisets/000003/versions/draft/assets/05a80228-04a7-4c3b-88d3-44a0c6b831b1/download/ web: https://dandiarchive.s3.amazonaws.com/girder-assetstore/6d/45/6d459d7889b04bf2a80d3211aa54ae39?versionId=qvIEoVh34LqShmSTloIQVBGmgE_TJBQl ok ```

so you should be able to use registerurl(..., batch=True) and forget about always_commit.

As a double check, after doing on some sample dandiset just see how many new commits you would get in git-annex branch -- there should be not as many as a "commit per file".

jwodder commented 3 years ago

@yarikoptic I used ds.repo.add_url_to_file() (ds.repo doesn't have a registerurl(), at least according to the documentation) with batch=True, and it ended up creating a separate commit in the git-annex branch for each file. What do you recommend instead, and how do I reset a dataset to its upstream state?

yarikoptic commented 3 years ago

replied in a PR