bids-standard / bids-specification

Brain Imaging Data Structure (BIDS) Specification
https://bids-specification.readthedocs.io/
Creative Commons Attribution 4.0 International
270 stars 155 forks source link

[BUG] lack of recommendations for datafiles that differ only by extension #1487

Open Remi-Gau opened 1 year ago

Remi-Gau commented 1 year ago

Describe your problem in detail.

Note that this issue may apply to more datatype in BIDS but I have not checked it systematically.

As far I can tell it is not mentioned in the specification that files cannot differ just by their extension.

For example, modifying the micr_SEM bids example to have 2 times the same data that differ only by extension:

/home/remi/github/bids/examples/micr_SEM
├── dataset_description.json
├── participants.json
├── participants.tsv
├── README
├── samples.json
├── samples.tsv
└── sub-01
    ├── ses-01
    │   └── micr
    │       ├── sub-01_ses-01_sample-A_photo.jpg  < -- data: file 1
    │       ├── sub-01_ses-01_sample-A_photo.json
    │       ├── sub-01_ses-01_sample-A_photo.tif  < -- data: file 2
    │       ├── sub-01_ses-01_sample-A_SEM.json
    │       └── sub-01_ses-01_sample-A_SEM.png
    ├── ses-02
    ├── sub-01_sessions.json
    └── sub-01_sessions.tsv

From my current reading of the spec, this could be valid.

And also the bids validator does not complain about this: except from sayaing that not all subject have the same number of files.

I have mostly checked with picture files *_photo.* (eeg, meg, micr) but it also seems to be the case for eeg files:

bids/examples/eeg_ds000117/sub-01/eeg
├── sub-01_coordsystem.json
├── sub-01_electrodes.tsv
├── sub-01_task-facerecognition_run-1_eeg.eeg <--- duplicate data file with different extension
├── sub-01_task-facerecognition_run-1_eeg.fdt
├── sub-01_task-facerecognition_run-1_eeg.set
├── sub-01_task-facerecognition_run-1_events.tsv
...

Am I missing something but maybe this type of potential data duplication should be disallowed?

Describe what you expected.

I would expect an error like for example in the case of .nii and .nii.gz where the validator throws this error:

[ERR] NIfTI file exist with both '.nii' and '.nii.gz' extensions. (code: 74 - DUPLICATE_NIFTI_FILES)
                ./sub-Sub103/perf/sub-Sub103_asl.nii
                ./sub-Sub103/perf/sub-Sub103_asl.nii.gz

BIDS specification section

No response

Remi-Gau commented 1 year ago

If this type of data duplication is to be disallowed, it may be a good thing to:

For example, the following rendering may suggest that all 3 files can co-exist in the same dataset

https://bids-specification.readthedocs.io/en/latest/modality-specific-files/electroencephalography.html#landmark-photos-_photoextension

Screenshot from 2023-05-04 21-48-08

maybe better to have something like:

sub-<label>[_ses-<label>][_acq-<label>]_photo.[tif|png|jpg]

effigies commented 1 year ago

Okay, here's a proposal:

photo:
  suffixes:
    - photo
  extensions:
    - [.jpg, .png, .tif]
  datatypes:
    - eeg
    - ieeg
    - meg
    - nirs
  entities:
    subject: required
    session: optional
    acquisition: optional

photo__micr:
  $ref: rules.files.raw.photo.photo
  extensions:
    - [.jpg, .png, .tif]
    - .json
  datatypes:
    - micr
  entities:
    $ref: rules.files.raw.photo.photo.entities
    sample: required

Here, the extensions that are in a list together are "the same kind" and so mutually exclusive and distinguishable from supplementary entries, such as .json.

For NIfTI, we would do - [.nii, .nii.gz].

Remi-Gau commented 1 year ago

BUT...

For EEG:

eeg:
  suffixes:
    - eeg
  extensions:
    - .json
    - .edf
    - .vhdr
    - .vmrk
    - .eeg
    - .set
    - .fdt
    - .bdf
  datatypes:
    - eeg
  entities:
    subject: required
    session: optional
    task: required
    acquisition: optional
    run: optional
effigies commented 1 year ago

I think we could do something like:

extensions:
  - .json
  - [ .edf, .eeg, .set, .bdf ]
  - .vhdr
  - .vmrk
  - .fdt

And then just use a couple checks to say that if any of .eeg, .vhdr or .vmrk exist, then they all exist. And if .fdt exists, then .set exists.

sappelhoff commented 1 year ago

👍

and for:

For file formats that are based on several files of different extensions, or a directory of files with different extensions (multi-file file formats), only that file SHOULD be listed that would also be passed to analysis software for reading the data. For example for BrainVision data (.vhdr, .vmrk, .eeg), only the .vhdr SHOULD be listed; for EEGLAB data (.set, .fdt), only the .set file SHOULD be listed; and for CTF data (.ds), the whole .ds directory SHOULD be listed, and not the individual files in that directory.

(see: https://bids-specification.readthedocs.io/en/latest/modality-agnostic-files.html#scans-file)