yarikoptic commented 3 years ago

girder will eventually deprecated. We will need to do the following

whenever dandi-api instance is populated with data, adjust the scripts to interact with dandi-api instance instead of girder one
populate with unambiguous (in versioning) URLs for each file (see below) && git annex remove url girder blob store urls (note, that it might need actually git-annex development: https://git-annex.branchable.com/todo/unregisterurl_KEY_URL/?updated)
- abandon keys referred in prior commits/versions. dandi-api store will not have history beyond released versions and mutable state in "draft"
- well -- we could keep those urls afloat until we do kill girder instance. That would allow to identify dandisets were users might be accessing older versions which are about to be wiped out (besides backup server + dropbox backup)

whenever girder is "frozen" and all data must have been uploaded to dandi-api

verify that no girder URLs are left. If that is happening for files in a worktree -- it means that we have hit a file which was not re-uploaded to dandi-api

Which URL?

cannot use /dandisets/{version__dandiset__pk}/versions/{version__version}/assets/{uuid}/download/ end point for assets of "draft" dandisets, because in current implementation assets are mutable.
could be direct urls to S3 -- blobstore now should have them by checksum, changes in place aren't expected. But we would loose "accounting" if we are to add one at the level of dandi-api
could be GET /blob/{sha256}/download (to mirror how it is for an asset) API call, which I think we should have anyways: filed https://github.com/dandi/dandi-api/issues/135. This way we could at least have accounting at the level of "blobs" (not dandisets/assets) for access to data from datalad dandisets.

WDYT @satra, which of the two URLs we should use or may be I have missed another opportunity?

satra commented 3 years ago

i was thinking about this yesterday as i was trying to figure when and what contentURL would go into the asset metadata.

i think you should use /dandisets/{version__dandiset__pk}/versions/{version__version}/assets/{uuid}/download/ it's ok if assets are mutable (for now). you have no guarantees that someone else did not delete things from the dandiset, so you will need to comb through the entire asset list. and if you can iterate over the asset list, you will get path, uuid, and therefore can generate download link.

yarikoptic commented 3 years ago

i think you should use /dandisets/{version__dandiset__pk}/versions/{version__version}/assets/{uuid}/download/ it's ok if assets are mutable (for now).

nah -- it would be just a waste of bandwidth for git-annex which would download a file only to discard it as checksum would not match (in comparison to failing right away if the blob is no longer available), and impossible to verify reliably if original content is still available before dropping it.
If we would like to RF the server to make assets immutable (i.e. go to the original design) - we should put it on the short-term roadmap.

As for assets metadata in exported manifests on S3 -- those most likely should be urls directly to S3 to make those manifests usable without interacting with dandi archive.

So may be contentURL should be a list which would have all 3 ;-)?

satra commented 3 years ago

why would it be a waste of bandwidth, you can download metadata which has all the pieces. you can check locally if the remote checksum is the same as the local checksum and update or discard as necessary.

can't you do all the operations based on the metadata without downloading any file? (as we have discussed before about creating git-annex datasets?)

yarikoptic commented 3 years ago

Are you suggesting to develop dandi special remote for git annex to access files from the archive via two calls (first metadata, then file) with tightly binding annex key backend and the choice of checksumming in the archive?

And that is just to overcome current shortcoming in the archive design and to gain (only potentially) better accountability? IMHO an overkill for little to no gain.

yarikoptic commented 3 years ago

37 is already adding new `blobs` urls for new assets. Now we just need to do one time migration of existing girder store urls to blobs store:

should ideally operate on all annexed keys, not just files in the current tree. But since probably not that many (if any atll) of previous versions of files which were present in girder store and then replaced with new versions would be present in blobs store. I guess we can sacrifice them (right @satra?)
currently all assets should also provide sha256 checksums (well, after they are computed, so might be missing for super fresh assets), so we can easily do matching based on annex key
every file in the tree should find an asset (and thus blobs) url
girder url should be removed (git annex rmurl file url) from the file after adding a replacement blobs url
operation on a dandiset should be done with delayed committing - instantiate AnnexRepo with always_commit=False or just call_annex with '-c', 'annex.alwayscommit=false' among cmdline parameters so we do not breed thousands of commits in git-annex branch. (well -- both rmurl and registerurl have --batch mode but it seems we have not exposed it in DataLad's interface, although should be relatively easy to interface).

jwodder commented 3 years ago

@yarikoptic Do you just want the new S3 URLs to be added to the assets, or should the download URLs be included as well?

instantiate AnnexRepo with always_commit=False

How do I do that when retrieving the AnnexRepo via dataset.repo?

well -- both rmurl and registerurl have --batch mode but it seems we have not exposed it in DataLad's interface

add_url_to_file claims to support batch=True. Should I be using that or registerurl?

yarikoptic commented 3 years ago

@yarikoptic Do you just want the new S3 URLs to be added to the assets, or should the download URLs be included as well?

Let's add both. Unfortunately I found no way to prioritize direct one over api (why to torture api server without need), so filed a /todo against git-annex: https://git-annex.branchable.com/todo/assign_costs_per_URL_or_better_repo-wide___40__regexes__41__/ . When/if implemented, then we could prioritize direct one. Note: please add direct S3 urls with ?versionId=. Relevant discussion/reasoning also at https://github.com/dandi/dandi-api/issues/231

an alternative could be - add only direct S3 urls now but then, after prioritization made possible, we would add api urls. But not sure if we should delay that

instantiate AnnexRepo with always_commit=False

How do I do that when retrieving the AnnexRepo via dataset.repo?

well -- both rmurl and registerurl have --batch mode but it seems we have not exposed it in DataLad's interface

add_url_to_file claims to support batch=True. Should I be using that or registerurl?

I have tested that we can consecutively call `annex addurl` on an existing file without causing download (vaguely remember now that support for that was added to git-annex awhile back)

```shell $> datalad install ///dandi/dandisets/000003 [INFO ] Scanning for unlocked files (this may take some time) [INFO ] access to 1 dataset sibling dandi-dandisets-dropbox not auto-enabled, enable with: | datalad siblings -d "/tmp/000003" enable -s dandi-dandisets-dropbox install(ok): /tmp/000003 (dataset) (dev3) (datalad-test-annex) 1 77409 [2].....................................:Fri 30 Apr 2021 10:05:13 AM EDT:. lena:/tmp $> cd 000003 dandiset.yaml sub-YutaMouse23/ sub-YutaMouse37/ sub-YutaMouse39/ sub-YutaMouse41/ sub-YutaMouse44/ sub-YutaMouse51/ sub-YutaMouse55/ sub-YutaMouse57/ sub-YutaMouse20/ sub-YutaMouse33/ sub-YutaMouse38/ sub-YutaMouse40/ sub-YutaMouse42/ sub-YutaMouse45/ sub-YutaMouse54/ sub-YutaMouse56/ (dev3) (datalad-test-annex) 1 77410 [2].....................................:Fri 30 Apr 2021 10:05:17 AM EDT:. (git-annex)lena:/tmp/000003[master] $> cd sub-YutaMouse20 sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb@ sub-YutaMouse20_ses-YutaMouse20-140325_behavior+ecephys.nwb@ sub-YutaMouse20_ses-YutaMouse20-140328_behavior+ecephys.nwb@ sub-YutaMouse20_ses-YutaMouse20-140324_behavior+ecephys.nwb@ sub-YutaMouse20_ses-YutaMouse20-140327_behavior+ecephys.nwb@ (dev3) (datalad-test-annex) 1 77411 [2].....................................:Fri 30 Apr 2021 10:05:26 AM EDT:. (git-annex)lena:/tmp/000003[master]sub-YutaMouse20 $> git annex addurl --file sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb https://api.dandiarchive.org/api/dandisets/000003/versions/draft/assets/05a80228-04a7-4c3b-88d3-44a0c6b831b1/download/ addurl https://api.dandiarchive.org/api/dandisets/000003/versions/draft/assets/05a80228-04a7-4c3b-88d3-44a0c6b831b1/download/ ok (recording state in git...) (dev3) (datalad-test-annex) 1 77412 [2].....................................:Fri 30 Apr 2021 10:06:10 AM EDT:. (git-annex)lena:/tmp/000003[master]sub-YutaMouse20 $> git annex whereis sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb whereis sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb (2 copies) 00000000-0000-0000-0000-000000000001 -- web b7fcf214-e492-4f2c-8789-708af9fd4656 -- dandi@drogon:/mnt/backup/dandi/dandisets/000003 The following untrusted locations may also have copies: 727f466f-60c3-4778-90b2-b2332856c2f8 -- dandi-dandisets-dropbox web: https://api.dandiarchive.org/api/dandisets/000003/versions/draft/assets/05a80228-04a7-4c3b-88d3-44a0c6b831b1/download/ web: https://dandiarchive.s3.amazonaws.com/girder-assetstore/6d/45/6d459d7889b04bf2a80d3211aa54ae39?versionId=qvIEoVh34LqShmSTloIQVBGmgE_TJBQl ok ```

so you should be able to use registerurl(..., batch=True) and forget about always_commit.

As a double check, after doing on some sample dandiset just see how many new commits you would get in git-annex branch -- there should be not as many as a "commit per file".

jwodder commented 3 years ago

@yarikoptic I used ds.repo.add_url_to_file() (ds.repo doesn't have a registerurl(), at least according to the documentation) with batch=True, and it ended up creating a separate commit in the git-annex branch for each file. What do you recommend instead, and how do I reset a dataset to its upstream state?

yarikoptic commented 3 years ago

replied in a PR

dandi / dandisets

migrate key urls over to new dandi-api from the girder blob store #34

Which URL?

37 is already adding new `blobs` urls for new assets. Now we just need to do one time migration of existing girder store urls to blobs store:

dandi / dandisets

migrate key urls over to new dandi-api from the girder blob store #34

Which URL?

37 is already adding new blobs urls for new assets. Now we just need to do one time migration of existing girder store urls to blobs store:

37 is already adding new `blobs` urls for new assets. Now we just need to do one time migration of existing girder store urls to blobs store: