Closed yarikoptic closed 3 years ago
i was thinking about this yesterday as i was trying to figure when and what contentURL would go into the asset metadata.
i think you should use /dandisets/{version__dandiset__pk}/versions/{version__version}/assets/{uuid}/download/
it's ok if assets are mutable (for now). you have no guarantees that someone else did not delete things from the dandiset, so you will need to comb through the entire asset list. and if you can iterate over the asset list, you will get path, uuid, and therefore can generate download link.
i think you should use
/dandisets/{version__dandiset__pk}/versions/{version__version}/assets/{uuid}/download/
it's ok if assets are mutable (for now).
nah -- it would be just a waste of bandwidth for git-annex which would download a file only to discard it as checksum would not match (in comparison to failing right away if the blob is no longer available), and impossible to verify reliably if original content is still available before dropping it.
If we would like to RF the server to make assets immutable (i.e. go to the original design) - we should put it on the short-term roadmap.
As for assets metadata in exported manifests on S3 -- those most likely should be urls directly to S3 to make those manifests usable without interacting with dandi archive.
So may be contentURL should be a list which would have all 3 ;-)?
why would it be a waste of bandwidth, you can download metadata which has all the pieces. you can check locally if the remote checksum is the same as the local checksum and update or discard as necessary.
can't you do all the operations based on the metadata without downloading any file? (as we have discussed before about creating git-annex datasets?)
Are you suggesting to develop dandi special remote for git annex to access files from the archive via two calls (first metadata, then file) with tightly binding annex key backend and the choice of checksumming in the archive?
And that is just to overcome current shortcoming in the archive design and to gain (only potentially) better accountability? IMHO an overkill for little to no gain.
blobs
urls for new assets. Now we just need to do one time migration of existing girder store urls to blobs store:git annex rmurl file url
) from the file after adding a replacement blobs urlAnnexRepo
with always_commit=False
or just call_annex with '-c', 'annex.alwayscommit=false'
among cmdline parameters so we do not breed thousands of commits in git-annex branch. (well -- both rmurl
and registerurl
have --batch
mode but it seems we have not exposed it in DataLad's interface, although should be relatively easy to interface).@yarikoptic Do you just want the new S3 URLs to be added to the assets, or should the download URLs be included as well?
instantiate
AnnexRepo
withalways_commit=False
How do I do that when retrieving the AnnexRepo
via dataset.repo
?
well -- both
rmurl
andregisterurl
have--batch
mode but it seems we have not exposed it in DataLad's interface
add_url_to_file
claims to support batch=True
. Should I be using that or registerurl
?
@yarikoptic Do you just want the new S3 URLs to be added to the assets, or should the download URLs be included as well?
Let's add both. Unfortunately I found no way to prioritize direct one over api (why to torture api server without need), so filed a /todo
against git-annex: https://git-annex.branchable.com/todo/assign_costs_per_URL_or_better_repo-wide___40__regexes__41__/ . When/if implemented, then we could prioritize direct one. Note: please add direct S3 urls with ?versionId=
. Relevant discussion/reasoning also at https://github.com/dandi/dandi-api/issues/231
an alternative could be - add only direct S3 urls now but then, after prioritization made possible, we would add api urls. But not sure if we should delay that
instantiate
AnnexRepo
withalways_commit=False
How do I do that when retrieving the
AnnexRepo
viadataset.repo
?well -- both
rmurl
andregisterurl
have--batch
mode but it seems we have not exposed it in DataLad's interface
add_url_to_file
claims to supportbatch=True
. Should I be using that orregisterurl
?
so you should be able to use registerurl(..., batch=True)
and forget about always_commit.
As a double check, after doing on some sample dandiset just see how many new commits you would get in git-annex
branch -- there should be not as many as a "commit per file".
@yarikoptic I used ds.repo.add_url_to_file()
(ds.repo
doesn't have a registerurl()
, at least according to the documentation) with batch=True
, and it ended up creating a separate commit in the git-annex
branch for each file. What do you recommend instead, and how do I reset a dataset to its upstream state?
replied in a PR
girder will eventually deprecated. We will need to do the following
&&
git annex
remove url girder blob store urls (note, that it might need actually git-annex development: https://git-annex.branchable.com/todo/unregisterurl_KEY_URL/?updated)dandi-api
store will not have history beyond released versions and mutable state in "draft"whenever girder is "frozen" and all data must have been uploaded to dandi-api
Which URL?
/dandisets/{version__dandiset__pk}/versions/{version__version}/assets/{uuid}/download/
end point for assets of "draft" dandisets, because in current implementation assets are mutable.GET /blob/{sha256}/download
(to mirror how it is for an asset) API call, which I think we should have anyways: filed https://github.com/dandi/dandi-api/issues/135. This way we could at least have accounting at the level of "blobs" (not dandisets/assets) for access to data from datalad dandisets.WDYT @satra, which of the two URLs we should use or may be I have missed another opportunity?