bids-standard / pybids

Python tools for querying and manipulating BIDS datasets.
https://bids-standard.github.io/pybids/
MIT License
219 stars 122 forks source link

Incorrect metadata may be returned in non-compliant datasets that have two valid sidecars for a single target in the same directory #530

Open bpinsard opened 4 years ago

bpinsard commented 4 years ago

pybids version: 0.9.4

I have corresponding jsons and niftis in a BIDS folder from heudiconv, but nifti.get_metadata often returns a value that is the opposite of what the json contains.

In [54]: json.get_dict()['PhaseEncodingDirection']                    
Out[54]: 'j'                                                               

In [55]: nii.get_metadata()['PhaseEncodingDirection']
Out[55]: 'j-'

I know this was a problem discussed in https://github.com/bids-standard/pybids/issues/40 but there are not much information about what was the problem and how it was fixed. The fact that the whole BIDS folder get scanned at the beginning and is mashed up in SQLAlchemy, which is very obscure for me, makes it very difficult to debug.

Is there a simple explanation of how things are loaded in pybids from jsons/filenames... ? What takes precedence? And how this kind of gross mismatch can occur?

Many BIDS-Apps rely on pybids to load parameters for preprocessing, and this could cause huge problems.

Thanks a lot.

tyarkoni commented 4 years ago

Which one of these is correct? And can you please paste: (a) the code you're using to construct the json and nii objects, and (b) the contents of the JSON files that (you think should) contain the corresponding values? Thanks!

bpinsard commented 4 years ago

The json one is correct.

layout=bids.BIDSLayout('./generic/',validate=False)

json=layout.get(subject='01',session='vid001',suffix="sbref",datatype='func',extension='json',task='bournesupremacy',run=1)[0]
nii=layout.get(subject='01',session='vid001',suffix="sbref",datatype='func',extension='nii.gz',task='bournesupremacy',run=1)[0]

nii.filename
'sub-01_ses-vid001_task-bournesupremacy_run-01_sbref.nii.gz'

json.filename
'sub-01_ses-vid001_task-bournesupremacy_run-01_sbref.json'

nii.get_metadata()['PhaseEncodingDirection']
'j'
json.get_dict()['PhaseEncodingDirection']
'j-'

Here is a clipped version of the json file:

...
 'PartialFourier': 1,                                           
 'PatientPosition': 'HFS',                                  
 'PercentPhaseFOV': 100,         
 'PhaseEncodingDirection': 'j-',      
 'PhaseEncodingSteps': 104,
 'PhaseResolution': 1,          
 'PixelBandwidth': 2290,  
...
tyarkoni commented 4 years ago

Thanks! Is there any JSON file in the hierarchy with a value of 'j'? I'm trying to determine whether it's reading from the wrong file, or clipping characters from the end of the right one...

tyarkoni commented 4 years ago

Can you try nii.get_associations() and paste the result? And if there are other JSON files in the list, take a look to see if any have a value of 'j'. It's possible some recent indexing changes in 0.9 screwed up the order in which metadata associations are registered.

tyarkoni commented 4 years ago

I also notice that the outputs in your two snippets don't align... in the first snippet, the json file has 'j' and the nifti image has 'j-'... in your second snippet, that flips. Is that correct, or was that a typo? if it's correct, that would probably corroborate my hunch that the indexing order is non-deterministic and metadata is getting mapped in the order files are encountered, even if it's incorrect. Let me know if that's right (you can try re-indexing a few times and seeing if the values change).

I'm not able to reproduce this on my end yet, but will try a couple more things...

bpinsard commented 4 years ago
nii.get_associations()
[<BIDSJSONFile filename='/data/neuromod/DATA/neuromod/video5/generic/sub-01/ses-vid001/func/sub-01_ses-vid001_task-bournesupremacy_run-01_sbref.json'>,
 <BIDSImageFile filename='/data/neuromod/DATA/neuromod/video5/generic/sub-01/ses-vid001/func/sub-01_ses-vid001_run-01_sbref.nii.gz'>,
 <BIDSImageFile filename='/data/neuromod/DATA/neuromod/video5/generic/sub-01/ses-vid001/func/sub-01_ses-vid001_task-bournesupremacy_run-01_bold.nii.gz'>]

The only json is the one that matches the filename.

The output flip is because I ran it on another subdataset, sorry about that.

Just to be sure: pybids does not use metadata embedded in dcmstack nifti extension? if there is a json task descriptor at the BIDS folder root, does it have lower precedence on the json that matches the filename?

How can I rerun the indexing? (which is about 10min on my dataset). Unrelated: is there a simple way to flush dangling SQLAlchemy transactions? (StatementError: (sqlalchemy.exc.InvalidRequestError) Can't reconnect until invalid transaction is rolled back)

tyarkoni commented 4 years ago

pybids does not use metadata embedded in dcmstack nifti extension?

It does not.

if there is a json task descriptor at the BIDS folder root, does it have lower precedence on the json that matches the filename?

It does. Or at least, it should. We have tests for this, which are all currently passing. Plus I can't reproduce your issue manually, after some experimentation (e.g., by putting a conflicting metadata file lower down).

You haven't made changes to the project (e.g., overwrite old metadata files) without resetting the DB, have you? I figure not, but just checking.

If you don't mind re-indexing, please do that... You can just reinitialize the BIDSLayout(). Maybe pass reset_database=True, just in case, though I doubt that will change anything.

Unrelated: is there a simple way to flush dangling SQLAlchemy transactions? (StatementError: (sqlalchemy.exc.InvalidRequestError) Can't reconnect until invalid transaction is rolled back)

That shouldn't be happening... I wonder if it's related to #523. Please open a separate issue describing the queries that trigger the problem. Thanks for your help!

tyarkoni commented 4 years ago

I notice you have both sub-01_ses-vid001_run-01_sbref.nii.gz and sub-01_ses-vid001_task-bournesupremacy_run-01_sbref.nii.gz files in the same directory... I'm wondering if that's what's causing the problem. I don't know that this is technically against the spec, but it seems counterintuitive, and I wonder if that's where the issue's coming from. What's the rationale for this?

effigies commented 4 years ago

task- is non-optional for sbref: https://bids-specification.readthedocs.io/en/stable/04-modality-specific-files/01-magnetic-resonance-imaging-data.html#task-including-resting-state-imaging-data

tyarkoni commented 4 years ago

Hah, I must be blind—I was literally just staring at those templates for like 2 minutes and still couldn't see task. :)

@bpinsard is there a JSON sidecar for sub-01_ses-vid001_run-01_sbref.nii.gz, and if so, does it have the wrong "PhaseEncodingDirection"? If so, I think what's going on here is that you've fallen into a weird edge case... I'm guessing that pybids doesn't expect to see two valid metadata files at the same level of the hierarchy, so it just takes the first one it encounters, which would be the wrong one.

If that's right, it's not technically a bug, inasmuch as the spec is clear that you can only have one relevant metadata file at any given level. It would also explain why we haven't encountered this before. That said, it would still probably be wise to add a check for this condition during indexing, as it will almost certainly come up again in future.

If that's not it, then I'm still stumped...

bpinsard commented 4 years ago

It does have the opposite phase direction, so this could be the cause. What is weird is that only the nii.gz is listed in the associations and not the json. I admit that I am lost on how pybids parses the structure.

By "only have one relevant metadata file at any given level" do you mean a single json in the whole /func folder ? I thought each nifti can have it's own json, at least that's what heudiconv gives.

tyarkoni commented 4 years ago

It does have the opposite phase direction, so this could be the cause.

Ah, good (from my perspective, at least). That's probably it then.

What is weird is that only the nii.gz is listed in the associations and not the json.

If you do get_associations(include_parents=True), it'll probably show up. Association through inheritance is suppressed by default to prevent long lists. This still leaves open the question of why the more distant file would be getting priority.

By "only have one relevant metadata file at any given level" do you mean a single json in the whole /func folder ? I thought each nifti can have it's own json, at least that's what heudiconv gives.

Neither... the rule is that "no more than one applicable file may be defined at a given level". So you can have as many sidecars in a directory as you like, so long as it's never the case that two of them apply to the same target. In this case, the problem (well, aside from the fact that you've omitted the mandatory task keyword) is that both of the JSON files validly apply to your nifti image. See Example 1 in the linked doc, which parallels the problem here.

I'm going to leave this issue open for now, as it's worth adding some extra validation to make sure this doesn't happen again in future. But since I think it can only happen in a case where the dataset is violating the spec, I probably won't get to it right away. I appreciate you catching it though! (And do feel free to open an issue for the other SQL issue, which seems unrelated.)

tyarkoni commented 4 years ago

BTW, this is definitely something that should be caught by the BIDS validator (both the missing task, and the presence of two sidecars). If it isn't, please open an issue on the validator repo. If it is, please make sure to run your dataset against the validator first in future. ;)

bpinsard commented 4 years ago

Ok thanks a lot for the detailed explanation! If I understood correctly by level you mean entity, so if a json doesn't have a task defined it applies to all tasks.

tyarkoni commented 4 years ago

I meant directory. If a given file has 5 defined entities, then the natural place to look for a sidecar first is a JSON file that has exactly the same name other than the extension (i.e., it shares the same 5 entities). But if the exact matching sidecar is missing, and there's another valid JSON file that shares 4 of the same defined entities (and values) as the target file (and has no unique ones itself), then the assumption is that that's the sidecar to use for the target. What the rule precludes is having multiple sidecar files in the same directory whose entities either perfectly match the target file's entities, or form a strict subset.