dandi / dandisets

735 Dandisets, 812.2 TB total. DataLad super-dataset of all Dandisets from https://github.com/dandisets
10 stars 0 forks source link

in dandisets and dandizarrs establish remote for web URLs with higher cost #320

Closed yarikoptic closed 1 year ago

yarikoptic commented 1 year ago

see http://git-annex.branchable.com/todo/Allow_for_URLs_prioritization_WITHIN___40__web__41___remote/#comment-915b0a31a1329226a6d431260326bd3d for more information etc. 10.20221212-103-gcfaae7e93 implements adding "sameas" remotes which would state higher cost for API URLs.

We would need to initremote in all new dandisets and dandizarrs and have a helper to tune all already existing dandisets/dandizarrs

jwodder commented 1 year ago

@yarikoptic So what exactly is the new initremote command to run?

yarikoptic commented 1 year ago

I haven't tried so can't give it exactly how it should be: please see what Joey added (following the URL I shared), check if works on sample dandiset using recently built git-annex, and then code that command you use.

jwodder commented 1 year ago

@yarikoptic I'm not entirely clear on the semantics around this new option. If I just run git-annex initremote --sameas=web dandiapi type=web urlinclude='*//api.dandiarchive.org/*' cost=300, is that supposed to cause all api.dandiarchive.org URLs to be given a higher priority than other web URLs? Is anything else needed?

yarikoptic commented 1 year ago

I don't know exactly -- that is why it needs checking. I asked @joeyh - we might need may be to somehow exclude that URL from web remote... or may be we should just make general web remote cost high but provide low cost remote which would point to s3 bucket url right away.

yarikoptic commented 1 year ago

according to Joey there should be no need for additional urlexclude. I have tested by cloning https://github.com/dandizarrs/0cb7a33b-827a-4bbd-a499-7c5f416a46cd , upgraded git-annex to fresh 10.20230126-1~ndall+1 and then without any changes getting info file lead to

get info [2023-01-30 17:29:08.728276346] (Utility.Process) process [181358] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","cat-file","--batch"]
(from web...) 
[2023-01-30 17:29:08.760300256] (Utility.Url) Request {
  host                 = "api.dandiarchive.org"
  port                 = 443
  secure               = True
  requestHeaders       = [("Accept-Encoding","identity"),("User-Agent","git-annex/10.20230126-1~ndall+1")]
  path                 = "/api/zarr/0cb7a33b-827a-4bbd-a499-7c5f416a46cd.zarr/info"
  queryString          = ""
  method               = "GET"
  proxy                = Nothing
  rawBody              = False
  redirectCount        = 10
  responseTimeout      = ResponseTimeoutDefault
  requestVersion       = HTTP/1.1
}

[2023-01-30 17:29:09.120919925] (Utility.Url) Request {
  host                 = "dandiarchive.s3.amazonaws.com"
  port                 = 443
  secure               = True
  requestHeaders       = [("Accept-Encoding","identity"),("User-Agent","git-annex/10.20230126-1~ndall+1")]
  path                 = "/zarr/0cb7a33b-827a-4bbd-a499-7c5f416a46cd/info"
  queryString          = "?versionId=WGMVZL8OxtIyjZjFFr.6rSnMz1P4lcD4"
  method               = "GET"
  proxy                = Nothing
  rawBody              = False
  redirectCount        = 10
  responseTimeout      = ResponseTimeoutDefault
  requestVersion       = HTTP/1.1

so - first API and then redirect to S3. After dropping that file and running

git-annex initremote --sameas=web dandiapi type=web urlinclude='*//api.dandiarchive.org/*' cost=300

redoing get lead to the desired

get info [2023-01-30 17:29:54.55095647] (Utility.Process) process [181693] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","cat-file","--batch"]
(from web...) 
[2023-01-30 17:29:54.57378069] (Utility.Url) Request {
  host                 = "dandiarchive.s3.amazonaws.com"
  port                 = 443
  secure               = True
  requestHeaders       = [("Accept-Encoding","identity"),("User-Agent","git-annex/10.20230126-1~ndall+1")]
  path                 = "/zarr/0cb7a33b-827a-4bbd-a499-7c5f416a46cd/info"
  queryString          = "?versionId=WGMVZL8OxtIyjZjFFr.6rSnMz1P4lcD4"
  method               = "GET"
  proxy                = Nothing
  rawBody              = False
  redirectCount        = 10
  responseTimeout      = ResponseTimeoutDefault
  requestVersion       = HTTP/1.1
}

[2023-01-30 17:29:55.067732695] (Annex.Perms) freezing content .git/annex/objects/w8/gv/MD5E-s2662--b2fb896517ac5aa01b292451b7cd0276/MD5E-s2662--b2fb896517ac5aa01b292451b7cd0276

so -- running that line should be sufficient. We should run it in every new dandiset/dandizarr and do it for all already present ones. (I found no easy way to check if already setup besides looking into remote.log of git-annex branch)

jwodder commented 1 year ago

@yarikoptic Unfortunately, the version of git-annex in conda-forge is still at 10.20220927, so I can't run the command in the extant datasets.