bids-standard / bids-validator

Validator for the Brain Imaging Data Structure
https://bids-standard.github.io/bids-validator/
MIT License
181 stars 109 forks source link

Support validating git-annex repos with remote-only files #458

Closed nellh closed 5 years ago

nellh commented 6 years ago

To be able to validate a dataset stored in S3, we need to extend the validator to accept a tree of files that includes some files only accessible by URL. The bulk of dataset content for OpenNeuro will soon be stored in an S3 bucket and not a local filesystem.

See https://github.com/OpenNeuroOrg/openneuro/issues/331

The URL list should be an ordered array of local paths and URLs to try. A file being read should look at the local version first, if the local version does not exist, then access each URL in order. Something like this:

{
  "localPath": "dataset_description.json",
  "remotePaths": [
    "https://openneuro.s3.amazonaws.com/ds000001/dataset_description.json?versionId=ox7Poh4aweeg8YieXohsh9", 
    "https://openneuro.org/datalad/dataset/ds000001/files/dataset_description.json"
  ]
}

Could be passed to BIDS.validate as an optional argument.

@chrisfilo Thoughts on this?

chrisgorgo commented 6 years ago

Looks good to me!

nellh commented 6 years ago

Since git-annex supports tracking the remote S3 files now, this can be accomplished with a more general solution that doesn't require the input object. Instead the validator can find these remote files if the directory passed in is a git-annex repo.

chrisgorgo commented 6 years ago

Sounds good. We should make sure that validation will minimize getting/pulling remote files. Probably we should not check nifti headers in such scenario. I don't think downloading JSON files could be avoided though.

nellh commented 6 years ago

For OpenNeuro, the JSON files are never annexed, they'll always be local. Non-OpenNeuro git-annex repos could have that issue but I think it's expected to lead to some data transfer in that case.

chrisgorgo commented 6 years ago

It's the .nii and other binary files I'm worried about. They can get very large and some tests (checking header consistency) depend on them. A good starting implementation would ignore those checks to avoid data transfer. Later we can look into downloading only X first bytes of the file to grab just the header.

nellh commented 5 years ago

git-annex whereis and git-annex list are the two commands we could wrap to get this information.

git-annex list output:

here
|s3-PRIVATE (untrusted)
||s3-PUBLIC (untrusted)
|||web
||||bittorrent
|||||
Xxx__ sub-01/anat/sub-01_T1w.nii.gz
Xxx__ sub-01/func/sub-01_task-onebacktask_run-01_bold.nii.gz
Xxx__ sub-01/func/sub-01_task-onebacktask_run-02_bold.nii.gz

git-annex whereis output:

whereis sub-01/anat/sub-01_T1w.nii.gz (1 copy) 
        18208862-9e4b-4868-bce0-6ccd8188f629 -- root@10e6c1213b8d:/datalad/ds001008 [here]

  The following untrusted locations may also have copies:
        0c257049-e243-43e8-aa16-23b73157520d -- [s3-PUBLIC]
        b5089f7d-8406-4398-a88a-750cd03a8e56 -- [s3-PRIVATE]
ok
whereis sub-01/func/sub-01_task-onebacktask_run-01_bold.nii.gz (1 copy) 
        18208862-9e4b-4868-bce0-6ccd8188f629 -- root@10e6c1213b8d:/datalad/ds001008 [here]

  The following untrusted locations may also have copies:
        0c257049-e243-43e8-aa16-23b73157520d -- [s3-PUBLIC]
        b5089f7d-8406-4398-a88a-750cd03a8e56 -- [s3-PRIVATE]
ok
whereis sub-01/func/sub-01_task-onebacktask_run-02_bold.nii.gz (1 copy) 
        18208862-9e4b-4868-bce0-6ccd8188f629 -- root@10e6c1213b8d:/datalad/ds001008 [here]

  The following untrusted locations may also have copies:
        0c257049-e243-43e8-aa16-23b73157520d -- [s3-PUBLIC]
        b5089f7d-8406-4398-a88a-750cd03a8e56 -- [s3-PRIVATE]
ok

The above whereis output example is from a repo without a trusted versioned S3 remote, so the output will differ for those remotes. We only want to consider trusted S3 remotes in the scope of this issue.

tjhendrickson commented 5 years ago

I am curious, what is the status of this? I have been looking for a way in order to validate BIDS s3 buckets and objects.

Additionally, is there a simple way to use BIDS s3 inputs to run BIDS-Apps on and output to a s3 bucket?