[BUG] lack of recommendations for datafiles that differ only by extension

Remi-Gau commented 1 year ago

Describe your problem in detail.

Note that this issue may apply to more datatype in BIDS but I have not checked it systematically.

As far I can tell it is not mentioned in the specification that files cannot differ just by their extension.

For example, modifying the micr_SEM bids example to have 2 times the same data that differ only by extension:

/home/remi/github/bids/examples/micr_SEM
├── dataset_description.json
├── participants.json
├── participants.tsv
├── README
├── samples.json
├── samples.tsv
└── sub-01
    ├── ses-01
    │   └── micr
    │       ├── sub-01_ses-01_sample-A_photo.jpg  < -- data: file 1
    │       ├── sub-01_ses-01_sample-A_photo.json
    │       ├── sub-01_ses-01_sample-A_photo.tif  < -- data: file 2
    │       ├── sub-01_ses-01_sample-A_SEM.json
    │       └── sub-01_ses-01_sample-A_SEM.png
    ├── ses-02
    ├── sub-01_sessions.json
    └── sub-01_sessions.tsv

From my current reading of the spec, this could be valid.

And also the bids validator does not complain about this: except from sayaing that not all subject have the same number of files.

I have mostly checked with picture files *_photo.* (eeg, meg, micr) but it also seems to be the case for eeg files:

bids/examples/eeg_ds000117/sub-01/eeg
├── sub-01_coordsystem.json
├── sub-01_electrodes.tsv
├── sub-01_task-facerecognition_run-1_eeg.eeg <--- duplicate data file with different extension
├── sub-01_task-facerecognition_run-1_eeg.fdt
├── sub-01_task-facerecognition_run-1_eeg.set
├── sub-01_task-facerecognition_run-1_events.tsv
...

Am I missing something but maybe this type of potential data duplication should be disallowed?

Describe what you expected.

I would expect an error like for example in the case of .nii and .nii.gz where the validator throws this error:

[ERR] NIfTI file exist with both '.nii' and '.nii.gz' extensions. (code: 74 - DUPLICATE_NIFTI_FILES)
                ./sub-Sub103/perf/sub-Sub103_asl.nii
                ./sub-Sub103/perf/sub-Sub103_asl.nii.gz

BIDS specification section

No response

Remi-Gau commented 1 year ago

If this type of data duplication is to be disallowed, it may be a good thing to:

mention this in the spec: in the part where extensions are defined? Somewhere else?
improve the way files that allow several extension are rendered by the filename pattern macros:

For example, the following rendering may suggest that all 3 files can co-exist in the same dataset

https://bids-specification.readthedocs.io/en/latest/modality-specific-files/electroencephalography.html#landmark-photos-_photoextension

Screenshot from 2023-05-04 21-48-08

maybe better to have something like:

sub-<label>[_ses-<label>][_acq-<label>]_photo.[tif|png|jpg]

effigies commented 1 year ago

Okay, here's a proposal:

photo:
  suffixes:
    - photo
  extensions:
    - [.jpg, .png, .tif]
  datatypes:
    - eeg
    - ieeg
    - meg
    - nirs
  entities:
    subject: required
    session: optional
    acquisition: optional

photo__micr:
  $ref: rules.files.raw.photo.photo
  extensions:
    - [.jpg, .png, .tif]
    - .json
  datatypes:
    - micr
  entities:
    $ref: rules.files.raw.photo.photo.entities
    sample: required

Here, the extensions that are in a list together are "the same kind" and so mutually exclusive and distinguishable from supplementary entries, such as .json.

For NIfTI, we would do - [.nii, .nii.gz].

Remi-Gau commented 1 year ago

BUT...

For EEG:

those work as triplet that go together: .vhdr, .vmrk, .eeg
and .set file with an OPTIONAL .fdt

eeg:
  suffixes:
    - eeg
  extensions:
    - .json
    - .edf
    - .vhdr
    - .vmrk
    - .eeg
    - .set
    - .fdt
    - .bdf
  datatypes:
    - eeg
  entities:
    subject: required
    session: optional
    task: required
    acquisition: optional
    run: optional

effigies commented 1 year ago

I think we could do something like:

extensions:
  - .json
  - [ .edf, .eeg, .set, .bdf ]
  - .vhdr
  - .vmrk
  - .fdt

And then just use a couple checks to say that if any of .eeg, .vhdr or .vmrk exist, then they all exist. And if .fdt exists, then .set exists.

sappelhoff commented 1 year ago

👍

and for:

eeg, vhdr, vmrk --> vhdr SHOULD be listed in scans.tsv
set, fdt --> set SHOULD be listed in scans.tsv

For file formats that are based on several files of different extensions, or a directory of files with different extensions (multi-file file formats), only that file SHOULD be listed that would also be passed to analysis software for reading the data. For example for BrainVision data (.vhdr, .vmrk, .eeg), only the .vhdr SHOULD be listed; for EEGLAB data (.set, .fdt), only the .set file SHOULD be listed; and for CTF data (.ds), the whole .ds directory SHOULD be listed, and not the individual files in that directory.

(see: https://bids-specification.readthedocs.io/en/latest/modality-agnostic-files.html#scans-file)

bids-standard / bids-specification