bids-standard / pybids

Python tools for querying and manipulating BIDS datasets.
https://bids-standard.github.io/pybids/
MIT License
219 stars 121 forks source link

derivatives=True breaks BIDSLayout #556

Open christian-oreilly opened 4 years ago

christian-oreilly commented 4 years ago
from pathlib import Path
from bids import BIDSLayout
import bids

for i in range(10):
    layout = BIDSLayout(Path(bids.__file__).parent / "tests" / "data" / "7t_trt", database_path=(Path.home()),
                        derivatives=True, reset_database=True)
    print(layout.entities["subject"].count())

gives something like

0
0
10
10
0
10
10
0
10
10

Removing the derivatives=True from the call to BIDSLayout, we obtain

10
10
10
10
10
10
10
10
10
10
adelavega commented 4 years ago

I can't reproduce your problem using your example:

...: from pathlib import Path 
...: from bids import BIDSLayout 
...: import bids 
...:  
...: for i in range(10): 
...:     layout = BIDSLayout(Path(bids.__file__).parent / "tests" / "data" / "7t_trt", database_path=(Path.home()), 
...:                         derivatives=True, reset_database=True) 
...:     print(layout.entities["subject"].count()) 
...:                                                                                                                                                                                                     
10
10
10
10
10
10
10
10
10
10
adelavega commented 4 years ago

Tried it on ds005 and I'm seeing something similar. I'll investigate.

adelavega commented 4 years ago

I tried this with the database off, and this error persists, so at least we can rule out the loading/saving as being the problem.

 ...: for i in range(10): 
 ...:     layout = BIDSLayout(Path(bids.__file__).parent / "tests" / "data" / "ds005", derivatives=True) 
 ...:     print(layout.entities["subject"].count()) 
 ...:                    
1
16
1
1
1
1
1
1
1
16

The correct value here should be 16.

When it returns one, the following still works:

In [106]: layout.get_subjects()                                                                                                                                                                              
Out[106]: 
['01',
 '02',
 '03',
 '04',
 '05',
 '06',
 '07',
 '08',
 '09',
 '10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16']

even though:


In [107]: print(layout.entities["subject"].unique())                                                                                                                                                         
['01']
tyarkoni commented 4 years ago

Ah, heh, I see what's going on. When you request layout.entities['subject'] with multiple layouts in scope (i.e., base and derivatives), the results are accumulated in a single dict. So, depending on the order layouts are returned (which will be random), you're actually calling count() on a different Entity object. The discrepancy in numbers above is because the derivative dataset has fewer subjects than the base dataset.

This will likely be a bit of a pain to solve, because the natural semantics here (count unique subjects over all layouts) don't mesh well with the internal implementation.

EDIT: I suppose a cheap "solution" would be have layout.entities["subject"] return a list of Entity objects, and then when you call .count() it would fail, and it's your responsibility to either iterate, or call layout.get_entities(scope='bids') to make sure you only get the base dataset's Entity.

Alternatively, we could make it so that when you access entities through the .entities property, rather than via get_entities(), you always get back only the base layout, and never any derivatives. That also seems pretty reasonable. Thoughts?

adelavega commented 4 years ago

I think just going off the base layout is a reasonable compromise.

Isn't this what we decided on for the __repr__ also?