dandi / dandi-schema

Schemata for DANDI archive project
Apache License 2.0
7 stars 10 forks source link

validate bids datasets #74

Open yarikoptic opened 3 years ago

yarikoptic commented 3 years ago

prompted by @satra in slack.

We need to RF our interface/approach to support that, since ATM dandischema validation is purely against the schema, and that is against pydantic model. Properties of current approach

If we were to stay in pure "python land" and just try to use bids-validator python module, its "usefulness" would be quite limited - I think it can only test for filenames to correspond to bids (which is good for P1 since we cannot even access data). And I don't think it is even using the WiP stock BIDS schema yet: https://github.com/bids-standard/bids-specification/tree/master/src/schema . But AFAIK we do not even have some stable Python API library to provide interfaces which would load/use that stock schema for validation (relevant original discussion etc: https://github.com/bids-standard/bids-specification/issues/543 ).

pybids uses that bids_validator module solely for that purpose but I guess does more of internal checks while constructing the layout. But even if pybids provided some extra power for validation, it is quite a heavy dependency:

list of stuff `pip install pybids` would pull on top of dandischema ```shell $> pip install pybids Collecting pybids Using cached pybids-0.13.1-py3-none-any.whl (3.2 MB) Collecting pandas>=0.23 Using cached pandas-1.3.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB) Collecting sqlalchemy<1.4.0.dev0 Using cached SQLAlchemy-1.3.24-cp39-cp39-manylinux2010_x86_64.whl (1.3 MB) Collecting numpy Using cached numpy-1.21.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.8 MB) Collecting scipy Using cached scipy-1.7.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (28.5 MB) Collecting click Using cached click-8.0.1-py3-none-any.whl (97 kB) Collecting bids-validator Using cached bids_validator-1.8.0-py2.py3-none-any.whl (19 kB) Collecting nibabel>=2.1 Using cached nibabel-3.2.1-py3-none-any.whl (3.3 MB) Collecting num2words Using cached num2words-0.5.10-py3-none-any.whl (101 kB) Collecting patsy Using cached patsy-0.5.1-py2.py3-none-any.whl (231 kB) Collecting packaging>=14.3 Using cached packaging-21.0-py3-none-any.whl (40 kB) Collecting pyparsing>=2.0.2 Using cached pyparsing-2.4.7-py2.py3-none-any.whl (67 kB) Collecting python-dateutil>=2.7.3 Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB) Collecting pytz>=2017.3 Using cached pytz-2021.1-py2.py3-none-any.whl (510 kB) Requirement already satisfied: six>=1.5 in ./venvs/dev3/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas>=0.23->pybids) (1.16.0) Collecting docopt>=0.6.2 Using cached docopt-0.6.2-py2.py3-none-any.whl Installing collected packages: pyparsing, pytz, python-dateutil, packaging, numpy, docopt, sqlalchemy, scipy, patsy, pandas, num2words, nibabel, click, bids-validator, pybids ```

Even if we decided to just use "official" JS bids-validator, I don't think we would have much luck since AFAIK it needs content of at least .json side car files (well, I guess those could be fetched, but it might be expensive in sheer number of them for some datasets). So not sure if that option is also easy to realize for remote (on dandi-api server) execution, besides: develop FUSE mount support of dandisets based on our /assets listing, and feeding that to bids-validator.

Altogether -- I do not see an easy way for "ultimate bids-validation" support, but I feel that we can get quite far if we use official WiP schema, even if for starter just to ensure bids dataset version compliant files naming/presence. We might want to instigate/contribute toward https://github.com/bids-standard/bids-specification/issues/543 (I will follow up there).

But also I think we might need to add some more explicit (not just an auto-summarized mentioning) indicator at dandiset level metadata description (metadata entry) that dataset is a BIDS dataset.

To address P2 I think we would need to "couple" a notion of a Dandiset and asset(s) for validation, and most likely need to validate BIDS dandiset as a whole for the purpose of BIDS validation and not per asset/file .

yarikoptic commented 1 year ago

note: paths of bids datasets could be validated using bidsschematools now as done in dandi-cli. no need for heavy pybids. Overall path validation - #157