[FIX] Move `rawdata/` into `sourcedata/raw` in alternative structure example, clarify on naming of datasets themselves

yarikoptic commented 3 months ago

Individual commits have more rationale. I can split into multiple PRs but let's see - may be this could provide us a closure. Reflecting my thoughts on

https://github.com/bids-standard/bids-specification/pull/1734/files#r1534475631

Closes #1736
Supersedes #1734
Closes #1519

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 87.79%. Comparing base (4c642bd) to head (1c029c6). Report is 53 commits behind head on master.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #1741 +/- ## ========================================== - Coverage 87.93% 87.79% -0.15% ========================================== Files 16 16 Lines 1351 1360 +9 ========================================== + Hits 1188 1194 +6 - Misses 163 166 +3 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Remi-Gau commented 3 months ago

What the example looks like

yarikoptic commented 3 months ago

What the example looks like

anal me notices now that dataset_description.json in that example is "inconsistent" it its ordering with directories...

apparently we do not have full consistency but mostly list it AFTER the folders... but then logically it might need ... before if there are sub- folders, and after to signal that more files are possible...

```shell src/appendices/qmri.md- { src/appendices/qmri.md- "ds-example": { src/appendices/qmri.md- "derivatives": { src/appendices/qmri.md- "qMRLab": { src/appendices/qmri.md: "dataset_description.json": "", src/appendices/qmri.md- "sub-01": { src/appendices/qmri.md- "anat": { src/appendices/qmri.md- "sub-01_T1map.nii.gz": "", src/appendices/qmri.md- "sub-01_T1map.json": "", -- src/common-principles.md- "my_dataset-1": { src/common-principles.md- "sourcedata": { src/common-principles.md- "dicoms": {}, src/common-principles.md- "raw": { src/common-principles.md: "dataset_description.json": "", src/common-principles.md- "sub-01": {}, src/common-principles.md- "sub-02": {}, src/common-principles.md- "...": "", src/common-principles.md- }, -- src/common-principles.md- "pipeline_1": {}, src/common-principles.md- "pipeline_2": {}, src/common-principles.md- "...": "", src/common-principles.md- }, src/common-principles.md: "dataset_description.json": "", src/common-principles.md- "...": "", src/common-principles.md- } src/common-principles.md- } src/common-principles.md-) }} -- src/common-principles.md- }, src/common-principles.md- "sub-01": {}, src/common-principles.md- "sub-02": {}, src/common-principles.md- "...": "", src/common-principles.md: "dataset_description.json": "", src/common-principles.md- } src/common-principles.md- } src/common-principles.md- ) }} src/common-principles.md- -- src/common-principles.md- }, src/common-principles.md- "derivatives": {}, src/common-principles.md- "README": "", src/common-principles.md- "participants.tsv": "", src/common-principles.md: "dataset_description.json": "", src/common-principles.md- "CHANGES": "", src/common-principles.md- } src/common-principles.md-) }} src/common-principles.md- -- src/derivatives/common-data-types.md- "raw/": { src/derivatives/common-data-types.md- "CHANGES": "", src/derivatives/common-data-types.md- "README": "", src/derivatives/common-data-types.md- "channels.tsv": "", src/derivatives/common-data-types.md: "dataset_description.json": "", src/derivatives/common-data-types.md- "participants.tsv": "", src/derivatives/common-data-types.md- "sub-001": { src/derivatives/common-data-types.md- "eeg": { src/derivatives/common-data-types.md- "sub-001_task-listening_events.tsv": "", -- src/longitudinal-and-multi-site-studies.md- }, src/longitudinal-and-multi-site-studies.md- "sub-control01_sessions.tsv": "", src/longitudinal-and-multi-site-studies.md- }, src/longitudinal-and-multi-site-studies.md- "participants.tsv": "", src/longitudinal-and-multi-site-studies.md: "dataset_description.json": "", src/longitudinal-and-multi-site-studies.md- "README": "", src/longitudinal-and-multi-site-studies.md- "CHANGES": "", src/longitudinal-and-multi-site-studies.md- } src/longitudinal-and-multi-site-studies.md-) }} ```

I will at least unify for that now in a separate commit (easy to drop if we decide not worth it)

yarikoptic commented 3 months ago

note: failed to upload coverage, I restarted that pipeline to get it hopefully green.

yarikoptic commented 2 months ago

yes, it is all green. Waiting for more blessings. @effigies @tsalo -- WDYT?

effigies commented 2 months ago

This feels more confusing to me.

└─ my_dataset-1/
   ├─ sourcedata/
   │  ├─ dicoms/
   │  ├─ raw/
   │  │  ├─ sub-01/
   │  │  ├─ sub-02/
   │  │  ├─ ... 
   │  │  └─ dataset_description.json 
   │  └─ ... 
   ├─ derivatives/
   │  ├─ pipeline_1/
   │  ├─ pipeline_2/
   │  └─ ... 
   ├─ dataset_description.json 
   └─ ...

Why is there now a dataset_description.json? What are the contents of this outer dataset apart from sourcedata/ and derivatives/?

I still don't understand why what is present is confusing, but maybe it would be better just to remove it. I would like to move to a Diataxis approach, where we clearly separate out the helpful hints and how-tos from the reference material. Maybe it could be revived in a how-to guide in a way that causes fewer problems, if there's any value to it.

yarikoptic commented 2 months ago

└─ my_dataset-1/ ... Why is there now a dataset_description.json? What are the contents of this outer dataset apart from sourcedata/ and derivatives/?

IMHO in BIDS specification we SHOULD only talk about BIDS datasets. We SHOULD NOT talk about arbitrary conventions which could exist outside of BIDS specification. Hence IMHO this example SHOULD be a BIDS dataset as well -- AFAIK nothing in BIDS specification would state that it is an illegitimate BIDS dataset, or am I wrong?

I still don't understand why what is present is confusing, but maybe it would be better just to remove it. I would like to move to a Diataxis approach, where we clearly separate out the helpful hints and how-tos from the reference material. Maybe it could be revived in a how-to guide in a way that causes fewer problems, if there's any value to it.

IMHO it is useful to give users guidance. My problem with current formulation is that it is misleading. This PR IMHO clarifies it, unless it is indeed "not BIDS compliant".

effigies commented 2 months ago

Hence IMHO this example SHOULD be a BIDS dataset as well -- AFAIK nothing in BIDS specification would state that it is an illegitimate BIDS dataset, or am I wrong?

I believe the validator would currently complain that there are no validatable files. You can argue that's wrong, but from an OpenNeuro perspective, I would want to reject such a dataset as it's effectively a blanket .bidsignore. But we would also reject what you're replacing, since it wasn't a BIDS dataset, but a layout demonstrating one way to include BIDS datasets alongside their sources and derivatives.

IMHO it is useful to give users guidance. My problem with current formulation is that it is misleading. This PR IMHO clarifies it, unless it is indeed "not BIDS compliant".

I find it unhelpful at best, as it drops everything but a dataset_description.json and two opaque directories that could potentially hold nested datasets. This is why I think that a how-to guide would be better, where it discusses these options in detail, and not normatively.

yarikoptic commented 2 months ago

there are no validatable files

wrong. There is dataset_description.json. Might be a bug in bids-validator then ;-)

You can argue that's wrong, but from an OpenNeuro perspective, I would want to reject such a dataset as it's effectively a blanket .bidsignore.

as we discussed -- this sounds like an archive specific desire/behavior.

But we would also reject what you're replacing, since it wasn't a BIDS dataset, but a layout demonstrating one way to include BIDS datasets alongside their sources and derivatives.

ok, let's meet half-way: what if I remove dataset_specification.json for now , and mostly revert that added statement that this is a valid bids dataset? and then propose separate follow up PR on that particular aspect since I think we would want also some explicit DatasetType = "study" or alike (e.g. "project" to possibly avoid conflict with BEP035) for such cases.

Then we stay with "convention"

effigies commented 2 months ago

Sure, if we drop dataset_description.json and ... and do not say or imply that the outer dataset is a BIDS dataset, then this is basically the same as before. I suppose that's fine.

yarikoptic commented 2 months ago

To expedite and ease re-reviewing -- I committed that final form of suggestions .

sappelhoff commented 2 months ago

This is what the example now looks like:

https://bids-specification--1741.org.readthedocs.build/en/1741/common-principles.html#source-vs-raw-vs-derived-data

grafik

I personally find it confusing that the BIDS raw data is now listed under sourcedata, because the way "people" usually think about the directories is:

sourcedata (what ever was recorded)
BIDS formatted raw data (converted from sourcedata)
derivatives (obtained from BIDS formatted raw data)

nesting "2" under "1" is technically possible, but I find it unhelpful/weird.

Having that said, @Remi-Gau and @effigies seem to have discussed this at length / thought about it, so I will defer to them.

yarikoptic commented 2 months ago

there is no 'ideal' setup per se, but if datalad used, then 2. (sourcedata/raw) could also have (possibly subset of only relevant ones) of sourcedata/ (like sourcedata/dicoms) be installed into its own sourcedata/raw/sourcedata . The idea here is just to make it all "flat" and separated only to sourcedata/ (bids or not) and derivatives/ (what computed from those).

yarikoptic commented 2 months ago

@effigies @sappelhoff so are we go on this one or more tune ups?

bids-standard / bids-specification

[FIX] Move `rawdata/` into `sourcedata/raw` in alternative structure example, clarify on naming of datasets themselves #1741

Codecov Report