bids-standard / bids-specification

Brain Imaging Data Structure (BIDS) Specification
https://bids-specification.readthedocs.io/
Creative Commons Attribution 4.0 International
279 stars 162 forks source link

schema: Define extensions exclusivity/composition? #1047

Open yarikoptic opened 2 years ago

yarikoptic commented 2 years ago

looking at https://github.com/bids-standard/bids-specification/pull/1033/files#diff-e1391ae7ff69f13355ee975c7fafc3414020f8ed7f3d26bfa92a4381429f51a0L4 and making a comment https://github.com/bids-standard/bids-specification/pull/1033/files#r838617368 where, as in many other places, we have "alternative" extensions for a file (so only one extension should be used) I also saw

participants:
  required: false
  extensions:
  - .tsv
  - .json

where it is allowed to have multiple or even "worse" -- having .json makes sense only if there is .tsv (unless for the inheritance we have .json on top level for some .tsv's down the hierarchy), and also looked up curious cases like

src/schema/rules/datatypes/anat.yaml:  extensions:
src/schema/rules/datatypes/anat.yaml-  - .nii.gz
src/schema/rules/datatypes/anat.yaml-  - .nii
src/schema/rules/datatypes/anat.yaml-  - .json

where, I think, it is not allowed to have both .nii.gz and .nii (while similarly to above ok to accompany with .json).

I wondered on how we could encode all that in the schema? initial thinking about separating into a dedicated sidecar_extensions, so would look like

src/schema/rules/datatypes/anat.yaml:  extensions:
src/schema/rules/datatypes/anat.yaml-  - .nii.gz
src/schema/rules/datatypes/anat.yaml-  - .nii
src/schema/rules/datatypes/anat.yaml-  sidecar_extensions:
src/schema/rules/datatypes/anat.yaml-  - .json

and thus making it implied that for extensions -- it is "one of" and then sidecar_extensions is those which could accompany, iff a file with extensions: is available. But would it then also be "one of" within sidecar_extensions -- do we have any counter use case which would render above suggestion invalid?

WDYT (attn @bids-standard/schema -- the Team I just initiated, feel welcome to invite/add more people if I forgot anyone)?

tsalo commented 2 years ago

I agree that we should figure out how to distinguish extensions that form sets from ones that do not (and are thus mutually exclusive). I think JSON is still a tough case because we have to figure out inheritance in the schema.

Are there any cases where a non-JSON data file can't have a sidecar JSON file? If there aren't, then I think we should probably remove .json from the list of extensions for non-JSON-based data files, and then just add .json as a special case in the rendering/validation code.

effigies commented 2 years ago

It could be worth abstracting the idea of sidecar that could apply to any file. BEP-027 is proposing that any file could have a .prov.jsonld sidecar.

yarikoptic commented 2 years ago

It could be worth abstracting the idea of sidecar that could apply to any file. BEP-027 is proposing that any file could have a .prov.jsonld sidecar.

good idea IMHO!

Are there any cases where a non-JSON data file can't have a sidecar JSON file?

we should check programmatically, quickly looking at grep output here are some hits we might want to "fix" indeed by making sidecar generally applicable ```shell src/schema/rules/datatypes/eeg.yaml-- suffixes: src/schema/rules/datatypes/eeg.yaml- - photo src/schema/rules/datatypes/eeg.yaml: extensions: src/schema/rules/datatypes/eeg.yaml- - .jpg src/schema/rules/datatypes/ieeg.yaml-- suffixes: src/schema/rules/datatypes/ieeg.yaml- - photo src/schema/rules/datatypes/ieeg.yaml: extensions: src/schema/rules/datatypes/ieeg.yaml- - .jpg src/schema/rules/datatypes/micr.yaml-- suffixes: src/schema/rules/datatypes/micr.yaml- - photo src/schema/rules/datatypes/micr.yaml: extensions: src/schema/rules/datatypes/micr.yaml- - .jpg src/schema/rules/datatypes/micr.yaml- - .png src/schema/rules/datatypes/micr.yaml- - .tif src/schema/rules/datatypes/meg.yaml-- suffixes: src/schema/rules/datatypes/meg.yaml- - markers src/schema/rules/datatypes/meg.yaml: extensions: src/schema/rules/datatypes/meg.yaml- - .sqd src/schema/rules/datatypes/meg.yaml- - .mrk src/schema/rules/datatypes/perf.yaml-- suffixes: src/schema/rules/datatypes/perf.yaml- - asllabeling src/schema/rules/datatypes/perf.yaml: extensions: src/schema/rules/datatypes/perf.yaml- - .jpg .... may be more ... ```

NB -- note inconsistency for _photo across datatypes

So a nuance in "abstracting" it, unlike in .prov.jsonld, without prescribed keys to include in that file, it would be generally "bogus"/useless to have such a sidecar file from a standardization point of view. But since it is allowed in BIDS to have arbitrary keys -- it would not be invalid.

BUT also there are datatypes only with .json ```shell src/schema/rules/datatypes/ieeg.yaml-- suffixes: src/schema/rules/datatypes/ieeg.yaml- - coordsystem src/schema/rules/datatypes/ieeg.yaml: extensions: src/schema/rules/datatypes/ieeg.yaml- - .json ... may be more ... ```

for https://bids-specification.readthedocs.io/en/stable/04-modality-specific-files/04-intracranial-electroencephalography.html#coordinate-system-json-_coordsystemjson -- which is pretty much a data/sidecar hybrid file... I think it is ok even to make the general rule is "any non-.json file can have .json sidecar file".

effigies commented 2 years ago

What about a metafiles.yml that specifically addresses sidecar files that don't have an independent existence? For universal ones like JSON sidecars, no specific associations need to be defined. For others we can define things like:

1) Selectors needed to match 2) Entities that cannot be dropped while matching 3) Whether this metafile MAY/MUST have a sidecar of its own

sidecar:
  extension: .json

provenance:
  extension: .prov.jsonld

events:
  suffix: events
  extension: .tsv
  match-entities:
    task: REQUIRED  # Indicates that this entity can't be dropped
  sidecar: OPTIONAL  # There MAY be sidecar JSON for these metafiles
  associations:
    suffix: [bold, eeg, meg, ieeg, beh, pet]

continuous:
  suffix: [physio, stim]
  extension: .tsv.gz
  match-entities:
    task: REQUIRED
  sidecar: REQUIRED  # There MUST be sidecar JSON for these metafiles
  associations:
    suffix: [bold, eeg, meg, ieeg, beh, pet]  

Sorry, I don't think this specifically addresses the above discussion, but I wanted to write it somewhere vaguely relevant while it was in my head.