OpenNeuroLab / metasearch

OpenNeuroLab MetaSearch App
https://openneurolab.github.io/metasearch
Apache License 2.0
16 stars 12 forks source link

replace URLs with versioned urls where possible since some are 'disappearing' already #15

Open yarikoptic opened 6 years ago

yarikoptic commented 6 years ago

What would you like to do:

while preparing datalad dataset we ran into a bunch of URLs 404ing since there were deleted in the bucket. But bucket was versioned seems after they were added and before they were removed so possibly those versions (or some other versions) are still available if null revision id would be provided, e.g.

$> wget -S 'http://fcp-indi.s3.amazonaws.com/data/Projects/CORR/Outputs/IBA_TRT/freesurfer/0027256-session_2/mri/T1.mgz?versionId=null' 
--2018-03-23 09:06:04--  http://fcp-indi.s3.amazonaws.com/data/Projects/CORR/Outputs/IBA_TRT/freesurfer/0027256-session_2/mri/T1.mgz?versionId=null
Resolving fcp-indi.s3.amazonaws.com (fcp-indi.s3.amazonaws.com)... 52.216.133.139
Connecting to fcp-indi.s3.amazonaws.com (fcp-indi.s3.amazonaws.com)|52.216.133.139|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  x-amz-id-2: jv1iiXrsK4IGUiRUAESIivfdxWabFalvSyDeW5SeHN0fpqfqY21l50xXf81cqvEsso8sBd8UOVA=
  x-amz-request-id: FA464ADD438B76F3
  Date: Fri, 23 Mar 2018 13:06:05 GMT
  Last-Modified: Mon, 17 Oct 2016 19:49:07 GMT
  ETag: "f71962c9688a8cc17e4e6ddff40c1946"
  x-amz-version-id: null
  Accept-Ranges: bytes
  Content-Type: application/octet-stream
  Content-Length: 3777778
  Server: AmazonS3
Length: 3777778 (3,6M) [application/octet-stream]
Saving to: ‘T1.mgz?versionId=null’

T1.mgz?versionId=null                                    100%[================================================================================================================================>]   3,60M  1,21MB/s    in 3,0s    

2018-03-23 09:06:07 (1,21 MB/s) - ‘T1.mgz?versionId=null’ saved [3777778/3777778]

$> wget -S 'http://fcp-indi.s3.amazonaws.com/data/Projects/CORR/Outputs/IBA_TRT/freesurfer/0027256-session_2/mri/T1.mgz'               
--2018-03-23 09:13:40--  http://fcp-indi.s3.amazonaws.com/data/Projects/CORR/Outputs/IBA_TRT/freesurfer/0027256-session_2/mri/T1.mgz
Resolving fcp-indi.s3.amazonaws.com (fcp-indi.s3.amazonaws.com)... 54.231.33.131
Connecting to fcp-indi.s3.amazonaws.com (fcp-indi.s3.amazonaws.com)|54.231.33.131|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 404 Not Found
  x-amz-request-id: D524E765315BF904
  x-amz-id-2: A49WIVJJZJB+N92BpqNIiSt75osl29SojPLHKzvgX1XPZRumO+43YGBjwwfPSEYWrTBCBwmxqX4=
  x-amz-delete-marker: true
  x-amz-version-id: ZT77s.ror9NN7Yt7bjGtH5h36leBw8Yp
  Content-Type: application/xml
  Transfer-Encoding: chunked
  Date: Fri, 23 Mar 2018 13:13:39 GMT
  Server: AmazonS3
2018-03-23 09:13:40 ERROR 404: Not Found.

since many urls do come from versioned fcp-indi bucket it I wondered if it would be great to remove ambiguity and make access more robust (unless bucket gets removed/recreated which would invalidate versionIds) by replacing URLs with versioned urls, like http://fcp-indi.s3.amazonaws.com/data/Projects/BGSP/orig_bids/sub-1435/ses-01/anat/sub-1435_ses-01_T1w.nii.gz?versionId=ZzwCQ1fzDpWfUZzNvVGqwAONQ_QL.eI9 instead of http://fcp-indi.s3.amazonaws.com/data/Projects/BGSP/orig_bids/sub-1435/ses-01/anat/sub-1435_ses-01_T1w.nii.gz . datalad ls could be of help here:

$> datalad ls -aL s3://fcp-indi/data/Projects/BGSP/orig_bids/sub-1435/ses-01/anat/sub-1435_ses-01_T1w.nii.gz                                                                          
Connecting to bucket: fcp-indi
[INFO   ] S3 session: Connecting to the bucket fcp-indi 
Bucket info:
  Versioning: S3ResponseError: 403 Forbidden
     Website: S3ResponseError: 403 Forbidden
         ACL: S3ResponseError: 403 Forbidden
data/Projects/BGSP/orig_bids/sub-1435/ses-01/anat/sub-1435_ses-01_T1w.nii.gz 2016-12-04T13:20:43.000Z 4853715 ver:ZzwCQ1fzDpWfUZzNvVGqwAONQ_QL.eI9  acl:AccessDenied  http://fcp-indi.s3.amazonaws.com/data/Projects/BGSP/orig_bids/sub-1435/ses-01/anat/sub-1435_ses-01_T1w.nii.gz?versionId=ZzwCQ1fzDpWfUZzNvVGqwAONQ_QL.eI9 [OK]
yarikoptic commented 6 years ago

a few more besides CORR and BGSP

addurls(error): /mnt/btrfs/datasets/datalad/crawl/labs/openneurolab/metasearch/abide_initiative/sub-50993/ses-1/T1_rep-0.mgz (file) [AnnexBatchCommandError: command 'addurl'
Error, annex reported failure for addurl (url='https://s3.amazonaws.com/fcp-indi/data/Projects/ABIDE_Initiative/Outputs/freesurfer/5.1/NYU_0050993/mri/T1.mgz'): {'command': 'addurl', 'file': None, 'success': False} [annexrepo.py:add_url_to_file:2086]]
addurls(error): /mnt/btrfs/datasets/datalad/crawl/labs/openneurolab/metasearch/gsp/sub-Sub0001_Ses1/ses-1/sub-0001_ses-01_T1w_rep-0.nii.gz (file) [AnnexBatchCommandError: command 'addurl'
Error, annex reported failure for addurl (url='https://s3.amazonaws.com/fcp-indi/data/Projects/BrainGenomicsSuperstructProject/orig_bids/sub-0001/ses-01/anat/sub-0001_ses-01_T1w.nii.gz'): {'command': 'addurl', 'file': None, 'success': False} [annexrepo.py:add_url_to_file:2086]]
addurls(error): /mnt/btrfs/datasets/datalad/crawl/labs/openneurolab/metasearch/ixi/sub-71/ses-1/IXI071-Guys-0770-T1_rep-0.nii.gz (file) [AnnexBatchCommandError: command 'addurl'
Error, annex reported failure for addurl (url='https://files.osf.io/v1/resources/5h7sv/providers/osfstorage/5839bc346c613b0210294263'): {'command': 'addurl', 'file': None, 'success': False} [annexrepo.py:add_url_to_file:2086]]
satra commented 6 years ago

@yarikoptic - i think that would be a good idea (to use versioned URLs).

it would also be great if we knew when objects matched to each other across git-annex repos. so if i already have abide from datalad, it would be nice that openneurolab/metasearch would not duplicate files locally.

how do we make crawlers common? datalad has crawlers, metasearch has crawlers, and it seems we should be able to use datalad crawlers to generate metasearch csv.

yarikoptic commented 6 years ago

versioned urls: I guess we could help with datalad.support.s3.get_versioned_url

matched objects: what would you expect then to be done, e.g. symlink to be created into another local (eg abide) dataset? or key file cp --reflink-auto'ed across? hardlinked?

common crawlers: I guess would indeed be nice if there was some "standard" or at least "common" collection of crawlers providing data about availability/versions/etc so different tools (metasearch, datalad,...) could use them. Someone should look into all the biocaddy and others I guess.

satra commented 6 years ago

@yarikoptic - i don't know how it would work, so here are some thoughts:

let's say there is a global filesystem on my computer (could be at annex level or datalad level).

datalad config --global-store /path/to/store or git-annex config --global-store /path/to/store

each git repo has its own local store (git annex), as normal. but, git annex would point to special local remote (global store). any file that's in global store will not be copied

when i do a get, and this remote is local, it will:

  1. try to fetch from global store
  2. fetch from other location and push to global store
  3. create a link locally

if i modify the file, the:

  1. the modification is staged locally, (if hard links are allowed, this is simple).
  2. moved to the global store on commit.
yarikoptic commented 6 years ago

@joeyh what do you think about above? seems to go along our discussion while at montreal. Such generic global-store could be "web-like" special remote providing access to keys, and otherwise not being trusted etc. "It could be provided by some normal local git-annex remote which could be registered also as any other git remote, so content could be "copied to" to populate it.

joeyh commented 6 years ago

Yaroslav Halchenko wrote:

@joeyh what do you think about above? seems to go along our discussion while at montreal. Such generic global-store could be "web-like" special remote providing access to keys, and otherwise not being trusted etc. "It could be provided by some normal local git-annex remote which could be registered also as any other git remote, so content could be "copied to" to populate it.

I don't understand what you're proposing, specifically, in the context of git-annex.

-- see shy jo

yarikoptic commented 4 years ago

FTR: regarding "global-store" -- understanding was achieved and implemented at git-annex level, see https://git-annex.branchable.com/tips/local_caching_of_annexed_files/