Closed nellh closed 5 years ago
Looks good to me!
Since git-annex supports tracking the remote S3 files now, this can be accomplished with a more general solution that doesn't require the input object. Instead the validator can find these remote files if the directory passed in is a git-annex repo.
Sounds good. We should make sure that validation will minimize getting/pulling remote files. Probably we should not check nifti headers in such scenario. I don't think downloading JSON files could be avoided though.
For OpenNeuro, the JSON files are never annexed, they'll always be local. Non-OpenNeuro git-annex repos could have that issue but I think it's expected to lead to some data transfer in that case.
It's the .nii and other binary files I'm worried about. They can get very large and some tests (checking header consistency) depend on them. A good starting implementation would ignore those checks to avoid data transfer. Later we can look into downloading only X first bytes of the file to grab just the header.
git-annex whereis
and git-annex list
are the two commands we could wrap to get this information.
git-annex list
output:
here
|s3-PRIVATE (untrusted)
||s3-PUBLIC (untrusted)
|||web
||||bittorrent
|||||
Xxx__ sub-01/anat/sub-01_T1w.nii.gz
Xxx__ sub-01/func/sub-01_task-onebacktask_run-01_bold.nii.gz
Xxx__ sub-01/func/sub-01_task-onebacktask_run-02_bold.nii.gz
git-annex whereis
output:
whereis sub-01/anat/sub-01_T1w.nii.gz (1 copy)
18208862-9e4b-4868-bce0-6ccd8188f629 -- root@10e6c1213b8d:/datalad/ds001008 [here]
The following untrusted locations may also have copies:
0c257049-e243-43e8-aa16-23b73157520d -- [s3-PUBLIC]
b5089f7d-8406-4398-a88a-750cd03a8e56 -- [s3-PRIVATE]
ok
whereis sub-01/func/sub-01_task-onebacktask_run-01_bold.nii.gz (1 copy)
18208862-9e4b-4868-bce0-6ccd8188f629 -- root@10e6c1213b8d:/datalad/ds001008 [here]
The following untrusted locations may also have copies:
0c257049-e243-43e8-aa16-23b73157520d -- [s3-PUBLIC]
b5089f7d-8406-4398-a88a-750cd03a8e56 -- [s3-PRIVATE]
ok
whereis sub-01/func/sub-01_task-onebacktask_run-02_bold.nii.gz (1 copy)
18208862-9e4b-4868-bce0-6ccd8188f629 -- root@10e6c1213b8d:/datalad/ds001008 [here]
The following untrusted locations may also have copies:
0c257049-e243-43e8-aa16-23b73157520d -- [s3-PUBLIC]
b5089f7d-8406-4398-a88a-750cd03a8e56 -- [s3-PRIVATE]
ok
The above whereis output example is from a repo without a trusted versioned S3 remote, so the output will differ for those remotes. We only want to consider trusted S3 remotes in the scope of this issue.
I am curious, what is the status of this? I have been looking for a way in order to validate BIDS s3 buckets and objects.
Additionally, is there a simple way to use BIDS s3 inputs to run BIDS-Apps on and output to a s3 bucket?
To be able to validate a dataset stored in S3, we need to extend the validator to accept a tree of files that includes some files only accessible by URL. The bulk of dataset content for OpenNeuro will soon be stored in an S3 bucket and not a local filesystem.
See https://github.com/OpenNeuroOrg/openneuro/issues/331
The URL list should be an ordered array of local paths and URLs to try. A file being read should look at the local version first, if the local version does not exist, then access each URL in order. Something like this:
Could be passed to BIDS.validate as an optional argument.
@chrisfilo Thoughts on this?