bids-standard / bids-validator

Validator for the Brain Imaging Data Structure
https://bids-standard.github.io/bids-validator/
MIT License
1 stars 4 forks source link

warn/error on bidsignore entries matching "legit" files #75

Open yarikoptic opened 5 years ago

yarikoptic commented 5 years ago

I initially thought that it might be more of interest to OpenNeuro itself, but then decided that it better be addressed at the validator level.

Look through .bidsignore files among OpenNeuro datasets: ```shell $> grep sub */.bidsignore ds000117/.bidsignore:**/sub-*_ses-mri_run-*_echo-*_FLASH.nii.gz ds000224/.bidsignore:sub-*/ses-struct*/anat/*veno* ds000234/.bidsignore:sub-01/func/ ds000234/.bidsignore:sub-02/func/ ds000234/.bidsignore:sub-03/func/ ds000234/.bidsignore:sub-04/func/ ds000234/.bidsignore:sub-05/func/ ds000235/.bidsignore:sub-01/func/ ds000235/.bidsignore:sub-02/func/ ds000235/.bidsignore:sub-03/func/ ds000235/.bidsignore:sub-04/func/ ds000236/.bidsignore:sub-01/func/ ds000236/.bidsignore:sub-02/func/ ds000236/.bidsignore:sub-03/func/ ds000236/.bidsignore:sub-04/func/ ds000236/.bidsignore:sub-05/func/ ... ds000240/.bidsignore:sub-61/func/sub-61_task-restEyesOpen_asl.nii.gz ds000240/.bidsignore:sub-62/func/sub-62_task-restEyesOpen_asl.nii.gz ds000240/.bidsignore:sub-63/func/sub-63_task-restEyesOpen_asl.nii.gz ds000248/.bidsignore:**/sub-*_acq-flipangle*-*_MEFLASH.nii.gz ds001219/.bidsignore:/sub-*/EEG/** ds001420/.bidsignore:sub-01/ses-01/pet/ ds001420/.bidsignore:sub-01/ses-02/pet/ ds001420/.bidsignore:sub-02/ses-01/pet/ ds001420/.bidsignore:sub-02/ses-02/pet/ ds001421/.bidsignore:sub-01/ses-01/pet/ ds001421/.bidsignore:sub-01/ses-02/pet/ ds001454/.bidsignore:sub-01/ses-*/beh ds001454/.bidsignore:sub-02/ses-*/beh ds001454/.bidsignore:sub-03/ses-*/beh ds001454/.bidsignore:sub-04/ses-*/beh ds001454/.bidsignore:sub-05/ses-*/beh ds001454/.bidsignore:sub-06/ses-*/beh ds001454/.bidsignore:sub-07/ses-*/beh ds001454/.bidsignore:sub-08/ses-*/beh ds001454/.bidsignore:sub-09/ses-*/beh ds001454/.bidsignore:sub-10/ses-*/beh ds001454/.bidsignore:sub-11/ses-*/beh ds001454/.bidsignore:sub-12/ses-*/beh ds001454/.bidsignore:sub-13/ses-*/beh ds001454/.bidsignore:sub-14/ses-*/beh ds001454/.bidsignore:sub-15/ses-*/beh ds001454/.bidsignore:sub-16/ses-*/beh ds001454/.bidsignore:sub-17/ses-*/beh ds001454/.bidsignore:sub-18/ses-*/beh ds001454/.bidsignore:sub-19/ses-*/beh ds001454/.bidsignore:sub-20/ses-*/beh ds001454/.bidsignore:sub-21/ses-*/beh ds001454/.bidsignore:sub-22/ses-*/beh ds001454/.bidsignore:sub-23/ses-*/beh ds001454/.bidsignore:sub-24/ses-*/beh ds001512/.bidsignore:/sub-*/EEG/** ds001547/.bidsignore:sub*/stat/*.nii ds001750/.bidsignore:**/sub-*/ses-*/meg/sub-*_ses-*_task-*_meg.ds/sub-*_ses-*_task-*_meg.eeg ds001750/.bidsignore:**/sub-*/ses-*/meg/sub-*_ses-*_task-*_meg.ds/bad.segments ds001750/.bidsignore:**/sub-*/ses-*/meg/sub-*_ses-*_task-*_meg.ds/BadChannels ds001750/.bidsignore:**/sub-*/ses-*/meg/sub-*_ses-*_task-*_meg.ds/hz*.ds/BadChannels ds001750/.bidsignore:**/sub-*/ses-*/behavdata/* ds001750/.bidsignore:**/sub-*/ses-*/hpi/* ds001769/.bidsignore:**/sub-*/ses-movie/func/*_desc-denoisedSm5_bold.nii.gz ds001769/.bidsignore:**/sub-*/ses-movie/func/*_desc-denoisedUnsm_bold.nii.gz ds001769/.bidsignore:**/sub-*/ses-movie/func/*_desc-denoisedSm5_bold.json ds001769/.bidsignore:**/sub-*/ses-movie/func/*_desc-denoisedUnsm_bold.json ds001769/.bidsignore:**/sub-*/ses-movie/func/*_components.nii.gz ds001769/.bidsignore:**/sub-*/ses-movie/func/*_mixing.tsv ds001769/.bidsignore:**/sub-*/ses-movie/func/*_decomposition.json ds001769/.bidsignore:**/sub-*/ses-movie/func/*_componentLabels.txt ds001840/.bidsignore:/sub-*/eyetrack ds001980/.bidsignore:/sub-*/fmap/sub-*_acq-multiband_dir-PA_dwi.nii.gz ds002015/.bidsignore:**/anat/sub*.nii.gz ds002015/.bidsignore:**/anat/sub*.json ds002016/.bidsignore:**/anat/sub*.nii.gz ds002016/.bidsignore:**/anat/sub*.json ```

and you can see, that some smart scientists decided to just "workaround" bids-validator by pretty much matching many if not all files, which otherwise are, or should be, BIDS compliant. (e.g. patterns such as **/anat/sub*.nii.gz). So I wondered if bids-validator may be somehow could outsmart them? E.g. whenever bidsignore entry matches a file which is legit BIDS file, it might as well worth raising an error to prevent such "swallow them all" entries to bidsignore

effigies commented 5 years ago

We've thought about how to handle situations like that. I'm happy to have the validator make a warning (which could be raised to an error by flag), but I wonder how well we're going to be able to out-smart a human with regexes. Our earlier thought was to set limits on either the proportion of files or proportion of data by bytes that were ignored.

Related: https://github.com/OpenNeuroOrg/openneuro/issues/1199

effigies commented 5 years ago

Also, this seems related to https://github.com/bids-standard/bids-validator/issues/836, which will also poke its nose behind the previously forbidden curtain of .bidsignore.

yarikoptic commented 5 years ago

Our earlier thought was to set limits on either the proportion of files or proportion of data by bytes that were ignored.

warning on "too many ignored files" could also be of value, although would be hard to define sensible threshold. That is why I thought about using "legit" BIDS files as a somewhat litmus paper. If bidsignore expression causes legit files to be ignored -- must be too wide, could cause troubles down the end (user expects a valid bids file to be used/processed, but it is ignored) etc

effigies commented 5 years ago

So the logic would move from:

if not ignore(file):
    validate(file)

To:

valid = validate(file)
ignored = ignore(file)
if not valid and not ignored:
    error
if valid and ignored:
    warn/error

This will definitely cause slowdowns, as right now we can save time by not trying to validate ignored files. But bids-standard/bids-validator#77 is going to have to do something similar, as well.

yarikoptic commented 5 years ago

yeah, no free lunch anywhere BUT in the datasets without any ignores the speed will stay the same! So you could treat it as a feature -- those who are not compliant must suffer (i.e. wait) longer ;)

effigies commented 4 years ago

Preserving part of bids-standard/legacy-validator#1103 in this issue:

I would be happy to see a (configurable) flag to limit .bidsignored files to some file count, total file size, or proportion of total dataset file count or size. I am not yet familiar enough with the code base to propose a mechanism, and previous discussions have not led to anybody proposing or critiquing a heuristic.

I would also be okay with setting restrictions on the entries to .bidsignore to prevent over-broad ignores. This could let us require users to say more specifically sub-*/EEG/*.mat, and give us the chance to identify likely misuses. I'm not really sure how plausible that is.

effigies commented 2 weeks ago

It should be possible to add a check over ignored filenames and see if they match any filename rules.