validate bids datasets - Githubissues

prompted by @satra in slack.

We need to RF our interface/approach to support that, since ATM dandischema validation is purely against the schema, and that is against pydantic model. Properties of current approach

P1: dandischema validates purely metadata, i.e. without looking into any data file (hence can be used by dandi-api server, which has no direct access to data blob files)
- a significant portion of "bids dataset" validation requires looking into side car .json and IIRC even actual data files
P2: operates "independently" on Dandiset and Asset records:
- only Dandiset seems to have some indication of either that Dandiset is a BIDS dataset since assetsSummary.dataStandard is populated while summarizing dandiset (should be identifier: RRID:SCR_016124)
- so whenever it gets to validate an Asset, it would not have information on the corresponding *Dandiset, and that it is a "BIDS asset" (note: if we allow for assets to be reused across dandisets, then the same asset might be part of a BIDS dataset and non-BIDS dataset)

If we were to stay in pure "python land" and just try to use bids-validator python module, its "usefulness" would be quite limited - I think it can only test for filenames to correspond to bids (which is good for P1 since we cannot even access data). And I don't think it is even using the WiP stock BIDS schema yet: https://github.com/bids-standard/bids-specification/tree/master/src/schema . But AFAIK we do not even have some stable Python API library to provide interfaces which would load/use that stock schema for validation (relevant original discussion etc: https://github.com/bids-standard/bids-specification/issues/543 ).

pybids uses that bids_validator module solely for that purpose but I guess does more of internal checks while constructing the layout. But even if pybids provided some extra power for validation, it is quite a heavy dependency:

list of stuff `pip install pybids` would pull on top of dandischema

```shell $> pip install pybids Collecting pybids Using cached pybids-0.13.1-py3-none-any.whl (3.2 MB) Collecting pandas>=0.23 Using cached pandas-1.3.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB) Collecting sqlalchemy<1.4.0.dev0 Using cached SQLAlchemy-1.3.24-cp39-cp39-manylinux2010_x86_64.whl (1.3 MB) Collecting numpy Using cached numpy-1.21.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.8 MB) Collecting scipy Using cached scipy-1.7.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (28.5 MB) Collecting click Using cached click-8.0.1-py3-none-any.whl (97 kB) Collecting bids-validator Using cached bids_validator-1.8.0-py2.py3-none-any.whl (19 kB) Collecting nibabel>=2.1 Using cached nibabel-3.2.1-py3-none-any.whl (3.3 MB) Collecting num2words Using cached num2words-0.5.10-py3-none-any.whl (101 kB) Collecting patsy Using cached patsy-0.5.1-py2.py3-none-any.whl (231 kB) Collecting packaging>=14.3 Using cached packaging-21.0-py3-none-any.whl (40 kB) Collecting pyparsing>=2.0.2 Using cached pyparsing-2.4.7-py2.py3-none-any.whl (67 kB) Collecting python-dateutil>=2.7.3 Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB) Collecting pytz>=2017.3 Using cached pytz-2021.1-py2.py3-none-any.whl (510 kB) Requirement already satisfied: six>=1.5 in ./venvs/dev3/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas>=0.23->pybids) (1.16.0) Collecting docopt>=0.6.2 Using cached docopt-0.6.2-py2.py3-none-any.whl Installing collected packages: pyparsing, pytz, python-dateutil, packaging, numpy, docopt, sqlalchemy, scipy, patsy, pandas, num2words, nibabel, click, bids-validator, pybids ```

Even if we decided to just use "official" JS bids-validator, I don't think we would have much luck since AFAIK it needs content of at least .json side car files (well, I guess those could be fetched, but it might be expensive in sheer number of them for some datasets). So not sure if that option is also easy to realize for remote (on dandi-api server) execution, besides: develop FUSE mount support of dandisets based on our /assets listing, and feeding that to bids-validator.

Altogether -- I do not see an easy way for "ultimate bids-validation" support, but I feel that we can get quite far if we use official WiP schema, even if for starter just to ensure bids dataset version compliant files naming/presence. We might want to instigate/contribute toward https://github.com/bids-standard/bids-specification/issues/543 (I will follow up there).

But also I think we might need to add some more explicit (not just an auto-summarized mentioning) indicator at dandiset level metadata description (metadata entry) that dataset is a BIDS dataset.

To address P2 I think we would need to "couple" a notion of a Dandiset and asset(s) for validation, and most likely need to validate BIDS dandiset as a whole for the purpose of BIDS validation and not per asset/file .

dandi / dandi-schema

validate bids datasets #74