bids-standard / bids-validator

Validator for the Brain Imaging Data Structure
https://bids-standard.github.io/bids-validator/
MIT License
1 stars 4 forks source link

add check for case collision intollerance #49

Open yarikoptic opened 3 years ago

yarikoptic commented 3 years ago

original proposal as a PR in BIDS: https://github.com/bids-standard/bids-specification/pull/858 seems to be getting approved and if merged soon will enter a release after v1.6.0-203-g811e17e2 ,

sappelhoff commented 3 years ago

bids-standard/bids-specification#858 has been merged, so the corresponding check in the validator could/should be worked on. Suggestions on where to start would be very welcome @rwblair :-)

effigies commented 3 years ago

I think we need to think through the logic of the check before we worry about where to implement.

Here's a proposal that only cares about collisions within directories:

def check_dir(dirname):
    files = listdir(dirname)
    if len(set(fname.lower() for fname in files)) != len(files):
        # report collision
    for fname in files:
        if isdir(fname):
            check_dir(fname)

It should catch colliding subjects, but not something like sub-s01/anat/sub-S01_..., and it won't care about if one subject uses mixed cases for its acquisition parameters and another uses lowercase. Do we care about these? They might be caught by other checks, in any case.

Here's another edge case: sub-01/anat/sub-01_acq-HiRes_T1w.nii.gz and sub-01/func/sub-01_task-rest_acq-hires_bold.nii.gz. I think this this should not be flagged, as it's not a collision, but if we implemented something like a global collision test for each entity, it would be. And others might think it should be.

yarikoptic commented 3 years ago

Since the issue is relatively rare, I think it might suffice to just operate on a full list of files/directories in the dataset

def check_paths(paths):
    lowered = defaultdict(list)
    for path in paths:
        path = path.rstrip('/')  # just to harnomize among possible directories with/without trailing /?
        lowered[path.lower()].append(path)
    for lower, origs in lowered.items():
        if len(origs) > 1:
            # report/return collision for `origs`  -> `lower`

note: paths better not to be "dereferenced" (so git-annex'ed symlinks remain symlinks)

rwblair commented 3 years ago

Regardless of if the validator is run via browser or node we create a dictionary with all the files as keys in the fileArrayToObject function in utils/files/readDir.js. Filenames could be normalized here and we could check for collisions, but we currently do not create/return issues at this stage of validation.

There is precedent for some validation happening in the ./utils/files/ directory with /utils/files/validateMisc.js. So one option would be adding a new file to do this validation into ./utils/file/ and then calling the new validator function in the ./validators/bids/fullTest.js similar to how validateMisc does.

effigies commented 3 years ago

Would either of these checks catch @mattcieslak's motivating use case?

nellh commented 3 years ago

Regardless of if the validator is run via browser or node we create a dictionary with all the files as keys in the fileArrayToObject function in utils/files/readDir.js. Filenames could be normalized here and we could check for collisions, but we currently do not create/return issues at this stage of validation.

There is precedent for some validation happening in the ./utils/files/ directory with /utils/files/validateMisc.js. So one option would be adding a new file to do this validation into ./utils/file/ and then calling the new validator function in the ./validators/bids/fullTest.js similar to how validateMisc does.

This check should consider ignored files, so it will need to be near the start of fullTest before they are removed.

mattcieslak commented 3 years ago

In our case there were two subjects that had the same ID but with different capitalization (sub-ndar, sub-NDAR) in the root bids directory. All the files within their directories matched the case of their subject directory name

yarikoptic commented 3 years ago

This check should consider ignored files, so it will need to be near the start of fullTest before they are removed.

"not sure" it should anyhow consult ignored files -- the problem from the collisions happens at the filesystem level, so even if e.g. sub-NDAR is ignored while sub-ndar not -- collision while downloading/datalad clone etc that dataset would result in trouble.