`rawdata/` is described in text but not part of the schema

yarikoptic commented 3 months ago

Inspired by

https://github.com/bids-standard/bids-specification/pull/1734

with some relevant prior discussion found in

https://github.com/bids-standard/bids-specification/issues/1009

apparently only the text talks about rawdata/ while src/schema does not have references beyond mentioning in the tests (@TheChymera could you clarify on either it is specific to rawdata/ there or could be another path):

❯ git grep rawdata
src/CHANGES.md:-   \[FIX] typo in "rawdata" example [#1045](https://github.com/bids-standard/bids-specification/pull/1045) ([sappelhoff](https://github.com/sappelhoff))
src/common-principles.md:            "rawdata": {
src/common-principles.md:`rawdata`, **only the `rawdata` subdirectory** needs to be a BIDS-compliant
src/common-principles.md:`derivatives`, or `rawdata` directory names.
src/common-principles.md:  "SourceDatasets": [ {"URL": "../../rawdata/"} ]
tools/schemacode/bidsschematools/tests/data/expected_bids_validator_xs_write.log:   * `/home/chymera/.data2/datalad/000026/rawdata/sub-EXC022/anat/sub-EXC022_ses-MRI_flip-1_VFA.nii.gz
tools/schemacode/bidsschematools/tests/test_validator.py:        "rawdata/sub-EXC022/anat/sub-EXC022_ses-MRI_flip-1_VFA.nii.gz"
tools/schemacode/bidsschematools/tests/test_validator.py:        "rawdata/sub-EXC022/anat/sub-EXC022_ses-MRI_flip-1_VFA.nii.gz"
tools/schemacode/bidsschematools/validator.py:        bids_paths = '~/.data2/datalad/000026/rawdata'

IMHO we should fix src/schema to "specify/allow" for rawdata/ in BIDS derivative datasets (not in raw)

effigies commented 3 months ago

I see: You want sourcedata/ to be actual pre-BIDS and rawdata/ to be a BIDS dataset inside a BIDS-derivative dataset.

This feels like an unnecessary complication. sourcedata/ and derivatives/ are always relative to the current dataset and mark a boundary. That feels like enough, and specifying some directories that serve the same purpose for different dataset types does not sound fun to try to express in schema and implement in a validator.

Also, in practice, many derivatives may have multiple "sources", such as the raw dataset and some preprocessing, e.g.,

ds000001-fmriprep/  # BIDS Derivative
    sourcedata/
        ds000001/  # BIDS
        freesurfer/  # Neither BIDS nor BIDS derivative, but still an input to my derivative

If we change sourcedata/ds000001 to rawdata/ what should we do with sourcedata/freesurfer?

yarikoptic commented 3 months ago

I see: You want sourcedata/ to be actual pre-BIDS and rawdata/ to be a BIDS dataset inside a BIDS-derivative dataset

not really -- I am just trying to ensure that schema matches the text we have. We should fix one or another. If not in BIDS 1.0, we should likely remove rawdata/ in BIDS 2.0.

I am with you generally that sourcedata/ should be enough and preferable for the reasons you stated. FWIW

rawdata/ in spec text was added as long ago as of ebe69efd30a83f17970a3638fc23387d28402e5a .
among openneuro datasets there is one with some rawdata/ ;) https://github.com/OpenNeuroDatasets/ds003563

effigies commented 3 months ago

So you want to remove text saying that you can place a raw BIDS dataset next to its source and derivatives, instead of nested? I don't understand why this example is a problem.

yarikoptic commented 3 months ago

I want standard to not use in examples arbitrary folder names on the top level which are not described by the standard. So I think we either

dissolve rawdata/ (in bids 2.0? 1.0?)
standardize it (add to allowed top level folders, possibly restricting for a derivative or "study" level dataset, #1739) -- this issue.

Otherwise that example is a "bad example" to be in a standard. We should not give examples with arbitrary folders being populated with anything, unless we allow for that (e.g. as subfolders of code/, sourcedata/ etc) .

TheChymera commented 3 months ago

I haven't looked much into subdirectories, since I think it's really bad to have them nested, and always try to keep them separate. Why not just have independent datasets which reference their parents in the dataset_description.json file? It seems so much simpler, and would also solve the many-to-many correspondence between raw and derivatives. That would also be coherent with not constraining names for top-level directories, which I also think is a good idea.

yarikoptic commented 3 months ago

@TheChymera thanks, but could you clarify on either it is specific to rawdata/ there or could be another path in the test?

TheChymera commented 3 months ago

@yarikoptic not sure I understand the question, but are you asking whether that's a random string for which “rawdata” just happens to be the value? If so, yes:

running test + diff with random other name used

```console [deco]~/src/bids-specification/tools/schemacode ❱ git rev-parse HEAD 6d13c807c7c8d594a35d302d9aa6cff7c802f054 [deco]~/src/bids-specification/tools/schemacode ❱ pytest bidsschematools/tests/test_validator.py =========================================================== test session starts ============================================================ platform linux -- Python 3.11.8, pytest-7.4.4, pluggy-1.4.0 rootdir: /home/chymera/src/bids-specification/tools/schemacode configfile: pyproject.toml plugins: mock-3.12.0, pyfakefs-5.3.5, rerunfailures-14.0, anyio-4.2.0, pkgcore-0.12.24 collected 20 items bidsschematools/tests/test_validator.py .................... [100%] ============================================================ 20 passed in 4.34s ============================================================ [deco]~/src/bids-specification/tools/schemacode ❱ ag rawdata -l0 | xargs -0 sed -i -e "s/rawdata/tralaladata/g" [deco]~/src/bids-specification/tools/schemacode ❱ pytest bidsschematools/tests/test_validator.py =========================================================== test session starts ============================================================ platform linux -- Python 3.11.8, pytest-7.4.4, pluggy-1.4.0 rootdir: /home/chymera/src/bids-specification/tools/schemacode configfile: pyproject.toml plugins: mock-3.12.0, pyfakefs-5.3.5, rerunfailures-14.0, anyio-4.2.0, pkgcore-0.12.24 collected 20 items bidsschematools/tests/test_validator.py .................... [100%] ============================================================ 20 passed in 5.20s ============================================================ [deco]~/src/bids-specification/tools/schemacode ❱ git --no-pager diff diff --git a/tools/schemacode/bidsschematools/tests/data/expected_bids_validator_xs_write.log b/tools/schemacode/bidsschematools/tests/data/expected_bids_validator_xs_write.log index 4d410c3c..a8533d90 100644 --- a/tools/schemacode/bidsschematools/tests/data/expected_bids_validator_xs_write.log +++ b/tools/schemacode/bidsschematools/tests/data/expected_bids_validator_xs_write.log @@ -3,5 +3,5 @@ SUMMARY: 0 out of 1 files were successfully validated, using the following regular expressions: - `.*?/sub-(?P[0-9a-zA-Z]+)/(|ses-(?P[0-9a-zA-Z]+)/)anat/sub-(?P=subject)(|_ses-(?P=session))(|_acq-(?P[0-9a-zA-Z]+))(|_ce-(?P[0-9a-zA-Z]+))(|_rec-(?P[0-9a-zA-Z]+))(|_run-(?P[0-9a-zA-Z]+))(|_part-(?P(mag|phase|real|imag)))_(T1w|T2w|PDw|T2starw|FLAIR|inplaneT1|inplaneT2|PDT2|angio|T2star)\.(nii.gz|nii|json)$` The following files were not matched by any regex schema entry: - * `/home/chymera/.data2/datalad/000026/rawdata/sub-EXC022/anat/sub-EXC022_ses-MRI_flip-1_VFA.nii.gz + * `/home/chymera/.data2/datalad/000026/tralaladata/sub-EXC022/anat/sub-EXC022_ses-MRI_flip-1_VFA.nii.gz The following mandatory regex schema entries did not match any files: diff --git a/tools/schemacode/bidsschematools/tests/test_validator.py b/tools/schemacode/bidsschematools/tests/test_validator.py index fd808fb7..37ae0423 100644 --- a/tools/schemacode/bidsschematools/tests/test_validator.py +++ b/tools/schemacode/bidsschematools/tests/test_validator.py @@ -64,11 +64,11 @@ def test_write_report(tmp_path): ] validation_result["path_tracking"] = [ "/home/chymera/.data2/datalad/000026/" - "rawdata/sub-EXC022/anat/sub-EXC022_ses-MRI_flip-1_VFA.nii.gz" + "tralaladata/sub-EXC022/anat/sub-EXC022_ses-MRI_flip-1_VFA.nii.gz" ] validation_result["path_listing"] = [ "/home/chymera/.data2/datalad/000026/" - "rawdata/sub-EXC022/anat/sub-EXC022_ses-MRI_flip-1_VFA.nii.gz" + "tralaladata/sub-EXC022/anat/sub-EXC022_ses-MRI_flip-1_VFA.nii.gz" ] report_path = tmp_path / "output_bids_validator_xs_write.log" diff --git a/tools/schemacode/bidsschematools/validator.py b/tools/schemacode/bidsschematools/validator.py index 34621a8c..5a73bca3 100644 --- a/tools/schemacode/bidsschematools/validator.py +++ b/tools/schemacode/bidsschematools/validator.py @@ -594,7 +594,7 @@ def validate_bids( :: from bidsschematools import validator - bids_paths = '~/.data2/datalad/000026/rawdata' + bids_paths = '~/.data2/datalad/000026/tralaladata' validator.validate_bids(bids_paths) Notes ```

edit: @yarikoptic took liberty to wrap long example/diff into <details></details>

yarikoptic commented 3 months ago

If so, yes:

"yes" as "it was a random choice and could be anything else", correct?

Then we should also fix the test if we decide that rawdata is part of the specification/legit.

bids-standard / bids-specification

`rawdata/` is described in text but not part of the schema #1736