Change suffix pattern to avoid matching things like "participants" in "participants.tsv"

bids-standard / pybids

Python tools for querying and manipulating BIDS datasets.

https://bids-standard.github.io/pybids/

MIT License

221 stars 122 forks source link

Change suffix pattern to avoid matching things like "participants" in "participants.tsv" #380

Open yarikoptic opened 5 years ago

yarikoptic commented 5 years ago

didn't check what would happen if I add a suffix column to participants.tsv but imho smells like a regression which caused suffix to be added where it shouldn't be... please confirm or clarify:

$> python -c 'import  bids as b; print(b.__version__); print(b.BIDSLayout(".").get_collections(level="dataset")[0].to_df().to_dict(orient="records"))'   
0.7.0
[{'language': '\xd1\x80\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9', 'gender': nan, 'age': '30-35', 'handedness': 'r', 'hearing_problems_current': 'n', 'subject': '01'}, {'language': 'english', 'gender': 'f', 'age': '20-25', 'handedness': 'r', 'hearing_problems_current': 'n', 'subject': '03'}]

now with 0.7.1 -- see "suffix" to appear:

$> python -c 'import  bids as b; print(b.__version__); print(b.BIDSLayout(".").get_collections(level="dataset")[0].to_df().to_dict(orient="records"))'
0.7.1
[{u'suffix': 'participants', 'language': '\xd1\x80\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9', 'gender': nan, 'age': '30-35', 'handedness': 'r', 'hearing_problems_current': 'n', 'subject': '01'}, {u'suffix': 'participants', 'language': 'english', 'gender': 'f', 'age': '20-25', 'handedness': 'r', 'hearing_problems_current': 'n', 'subject': '03'}]

```shell
$> cat participants.tsv 
participant_id  gender  age handedness  hearing_problems_current    language
sub-01  n/a 30-35   r   n   русский
sub-03  f   20-25   r   n   english

yarikoptic commented 5 years ago

btw -- doesn't happen in 0.8!

$> python -c 'import  bids as b; print(b.__version__); print(b.BIDSLayout(".").get_collections(level="dataset")[0].to_df().to_dict(orient="records"))'                                                   
0.7.0+51.g4da3e8d                                                                                    
[{'language': '\xd1\x80\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9', 'gender': nan, 'age': '30-35', 'handedness': 'r', 'hearing_problems_current': 'n', 'subject': '01'}, {'language': 'english', 'gender': 'f', 'age': '20-25', 'handedness': 'r', 'hearing_problems_current': 'n', 'subject': '03'}]

yarikoptic commented 5 years ago

as for the question on "what if I have suffix column" - 0.8 works correctly, in 0.7.1. gets overwritten with suffix="participants". in 0.7.0 also as good as 0.8. FWIW

$> python -c 'import  bids as b; print(b.__version__); print(b.BIDSLayout(".").get_collections(level="dataset")[0].to_df().to_dict(orient="records"))'
0.7.1
[{'suffix': 'participants', 'language': '\xd1\x80\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9', 'gender': nan, 'age': '30-35', 'handedness': 'r', 'hearing_problems_current': 'n', 'subject': '01'}, {'suffix': 'participants', 'language': 'english', 'gender': 'f', 'age': '20-25', 'handedness': 'r', 'hearing_problems_current': 'n', 'subject': '03'}]

$> cat participants.tsv 
participant_id  gender  age handedness  hearing_problems_current    language    suffix
sub-01  n/a 30-35   r   n   русский suf1
sub-03  f   20-25   r   n   english suf2
(dev) 1 21448.....................................:Fri 08 Feb 2019 05:33:30 PM EST:.
(git-annex)hopa:~/.tmp/datalad_temp_tree_test_within_ds_file_searchhwDbsv[master]
$> cat 'import  bids as b; print(b.__version__); print(b.BIDSLayout(".").get_collections(level="dataset")[0].to_df().to_dict(orient="records"))'
cat: 'import  bids as b; print(b.__version__); print(b.BIDSLayout(".").get_collections(level="dataset")[0].to_df().to_dict(orient="records"))': No such file or directory
(dev) 1 21450 ->1.....................................:Fri 08 Feb 2019 05:33:38 PM EST:.
(git-annex)hopa:~/.tmp/datalad_temp_tree_test_within_ds_file_searchhwDbsv[master]
$> python -c 'import  bids as b; print(b.__version__); print(b.BIDSLayout(".").get_collections(level="dataset")[0].to_df().to_dict(orient="records"))'
0.7.0+51.g4da3e8d
[{'suffix': 'suf1', 'language': '\xd1\x80\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9', 'gender': nan, 'age': '30-35', 'handedness': 'r', 'hearing_problems_current': 'n', 'subject': '01'}, {'suffix': 'suf2', 'language': 'english', 'gender': 'f', 'age': '20-25', 'handedness': 'r', 'hearing_problems_current': 'n', 'subject': '03'}]

tyarkoni commented 5 years ago

If it works in 0.8, can we close this? The internals are almost completely reworked in 0.8, so there's no point in patching this for 0.7.1—I don't want to actively support multiple parallel versions, pybids isn't Python ;)

yarikoptic commented 5 years ago

well - since it is present in the released version, best to keep it open. I have added a label fixed-in-0.8 to mark such issues which you could later safely close with 0.8 release.

Re 0.8 not having it (yet) -- 0.8 is a branch which started before 0.7.1 which is on master. What is your plan for arranging the branches, I guess if 0.8 to come to master it would be a merge which would disregard all changes in its current shape?

tyarkoni commented 5 years ago

I'm going to merge master into 0.8, but there will probably be some recent PRs that no longer make sense given the refactoring in 0.8.

yarikoptic commented 5 years ago

Then issues present in master could resurface in 0.8 . Would be worthwhile to make tests for them

tyarkoni commented 5 years ago

Oh, sorry, I didn't read the original report carefully. This is somewhere between bug and feature. What was previously happening is that we were restricting the entities tracked inside variables to just run, subject, session, and task. This was problematic because in some cases other information is needed. So the fix was to expand the set of tracked entities to pretty much everything defined in bids.json. The unfortunate consequence of this is that you now get fields like suffix even when they're no use to you.

We probably can't revert to the original behavior, and there are cases where it's clearly useful to know what the source suffix of a row was—e.g., for a run-level BIDSVariableCollection, you can have rows from regressors, events, physio, etc., and it seems worth retaining that information. @yarikoptic, is your objection to having suffix show up at all, or to the fact that it's showing up for a file (participants.tsv) that technically shouldn't have a suffix? The latter seems like a minor bug we can fix by changing the suffix pattern to require a preceding _. Hopefully that will sort the issue out. But if your objection is to the former, then maybe we should think about handling that via an optional argument that allows users to control which "extra" entities get returned. WDYT?

tyarkoni commented 5 years ago

I should also note that making it so that participants.tsv doesn't have a suffix is arguably inconsistent with the way we treat all other files... i.e., it seems reasonable to interpret the suffix as "the thing just before the extension", even if there's nothing else in the filename. One benefit of this convention is it allows you to retrieve participants.tsv via get(). While it's probably not ideal, I don't think there's actually any other way to do that at the moment (though that's probably a good reason to add a full-string search option).

yarikoptic commented 5 years ago

Thinking out loud while looking at

$> python -c 'import  bids as b; print(b.__version__); print(b.BIDSLayout(".").get_collections(level="dataset")[0].to_df().to_dict(orient="records"))'
0.7.1
[{u'suffix': 'participants', 'language': '\xd1\x80\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9', 
'gender': nan, 'age': '30-35', 'handedness': 'r', 'hearing_problems_current': 'n', 'subject': '01'},
 {u'suffix': 'participants', 'language': 'english', 'gender': 'f', 'age': '20-25', 'handedness': 'r', 'hearing_problems_current': 'n', 'subject': '03'}]

For me, the suffix is generally a "leftover" after all the key-value pairs in the filename right before the suffix. So far it kinda meets your description of the intention. But if we dig deeper, that _suffix is yet another metadata point, e.g. it used to be a "modality" (when talking about func/ or anat/). So it makes sense to place it into the record for that file. Looking back at participants.tsv, that is why suffix is not really aligned the same way with it or other files which are not exactly the files following key-value_suffix.ext convention (CHANGES, README.md and even dataset_description.json) . Here, 'suffix': "participants" isn't really a metadata point associated with a subject record. I kinda see that it might be useful to be available at some level (entire collection?) to possibly describe the origin of that record/collection, but it is not a metadata field to be added into the record itself. As I've mentioned before, it might even collide with a legit according to BIDS "suffix" column happen I had such one in participants.tsv. So IMHO it better to not overgeneralize suffix to be a "file suffix for any file as the part preceding possible _ in the filename", but retain at least some semantic to it (as being present only in the files following key-values_suffix.ext convention where suffix provides additional metadata).

On a possibly related note, may be just to familiarize myself more with pybids semantic of entities/variables. Looking at that BIDSVariableCollection:

(git-annex)hopa:~/datalad/openneuro/ds000001[master]git-annex
$> head -n 2 participants.tsv
participant_id  sex age
sub-01  F   26

$> python -c 'import  bids as b; print(b.BIDSLayout(".").get_collections(level="dataset")[0].to_df().to_dict(orient="records")[0])'                                                                                                                         {'age': 26, 'sex': 'F', u'suffix': 'participants', 'subject': '01'}                             

$> python -c 'import  bids as b; print((b.BIDSLayout(".").get_collections(level="dataset")[0]).entities)'
{u'suffix': 'participants'}                                                                   

$> python -c 'import  bids as b; print((b.BIDSLayout(".").get_collections(level="dataset")[0]).variables)'
{'age': <bids.variables.variables.SimpleVariable object at 0x7f1887dddd50>, 'sex': <bids.variables.variables.SimpleVariable object at 0x7f1887cdd450>}

so that record combines variables with entities values (what are those exactly?) and I cannot get what subject index belongs to:

$> python -c 'import  bids as b; print(dir(b.BIDSLayout(".").get_collections(level="dataset")[0]))'            
['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_index_entities', 'clone', 'entities', 'from_df', 'level', 'match_variables', 'matches_entities', 'merge_variables', 'to_df', 'variables']

$> python -c 'import  bids as b; print((b.BIDSLayout(".").get_collections(level="dataset")[0])._index_entities)'
<bound method BIDSVariableCollection._index_entities of <bids.variables.kollekshuns.BIDSVariableCollection object at 0x7f56965fbad0>>

# So it is a method!
$> python -c 'import  bids as b; print((b.BIDSLayout(".").get_collections(level="dataset")[0])._index_entities())'
None

tyarkoni commented 5 years ago

The issue of column names conflicting is a separate one, IMO... In principle users could also put columns named subject, run, or session in their tsv files, and I don't think the spec currently bars that. The way to deal with that is to have a general check for conflicts and raise a warning or error if one is found. But I don't think we need to handle suffix as a special case.

I'm okay changing the suffix pattern to force a _ to exist before the suffix. But it's conceivable this could break someone's existing workflow (though it seems unlikely). Let me think it through and make sure there aren't any other unanticipated consequences.

On a possibly related note, may be just to familiarize myself more with pybids semantic of entities/variables.

A BIDSVariableCollection maintains .variables and .entities separately (or actually, it might be the Variables themselves that have the .entities; I don't remember). It's only the to_df call that mixes them, which seems to me like the right thing to do—if you're asking for a tabular summary of your entire collection, you should probably get as much as possible in there. Note that you can pass entities=False if you don't want to include the entities as columns. (This doesn't show up in the docstrings because it's passed forward as kwargs, but maybe it's important enough to make explicit.)

yarikoptic commented 5 years ago

ok, let me take another approach, which relates to my "semantic argument".

I think that ideally PyBIDS should be as close to BIDS specification as possible in terms of the metadata fields it extracts/associates with the files. It should not add some additional metadata fields since that might conflict in the future with the fields which BIDS standardizes. ATM "suffix" term is actually not clearly formalized in the specification, and if mentioned anywhere it is mentioned as the indicator of a contrast, or being physio, or stim:

$> git grep suffix
src/04-modality-specific-files/01-magnetic-resonance-imaging-data.md:| NegativeContrast            | OPTIONAL. Boolean (`true` or `false`) value specifying whether increasing voxel intensity (within sample voxels) denotes a decreased value with respect to the contrast suffix. This is commonly the case when Cerebral Blood Volume is estimated via usage of a contrast agent in conjunction with a T2\* weighted acquisition protocol.      |
src/04-modality-specific-files/04-physiological-and-other-continous-recordings.md:suffix. For example: `sub-control01_task-nback_run-1`. If the same continuous
src/04-modality-specific-files/04-physiological-and-other-continous-recordings.md:suffix, and signals related to the stimulus should use `_stim` suffix. For
src/04-modality-specific-files/04-physiological-and-other-continous-recordings.md:`_physio` suffix.
src/CHANGES.md:-   Added recommendation to use \_physio suffix for continuous recordings
src/pregh-changes.md:-   Added recommendation to use \_physio suffix for continuous recordings

It is not mentioned for any other auxiliary file such as participants.tsv, README, etc. So it would be reasonable to have suffix only for those data files where the term suffix is used in BIDS specification.

Re conflicts:

In principle users could also put columns named subject, run, or session in their tsv files, and I don't think the spec currently bars that.

Thanks for mentioning subject. I think that is one of the rush-caused deficiencies in BIDS - use of both "participant" and "subject". I have raised voice in the google docs times but I think it was "too late" to harmonize, so we are stuck with participants.tsv and sub-, and the specification using both terms interchangeably and it is somewhat hardcoded in the BIDS users minds. So now participants.tsv has participant_id already and adding subject column although possible, like adding any other column with identical "semantically" meaning would unlikely to be done. run and session also have clear semantics in BIDS, so unlikely to be placed into participants.tsv although, I must say, session column might even make sense in some cases where dataset was not collected/prepared as multisession, BUT researcher might decide to associate some session with any particular subject (e.g. as an index of a scanning session participant ever had).

So, overall, I maintain that suffix field where there is no suffix semantic in the BIDS spec for the file, or as explicit specification in particular dataset or participants.tsv, shouldn't happen

I'm okay changing the suffix pattern to force a _ to exist before the suffix.

What about a bit more evolved matching to assure that suffix is prefixed with at least a single key-value_ pair?

In [24]: samples = ['participants.tsv', 'dataset_description.json', 'task-blah_bo
    ...: ld.json', 'sub-01/func/sub-01_task-t1-01_bold.nii.gz']

In [25]: r = re.compile('^(\S+/)?(?:\S+-\S+_)+(?P<suffix>[^\s\.]+)\..*'); res = [
    ...: r.match(s) for s in samples]; print(res)
[None, None, <_sre.SRE_Match object at 0x7f904ac37ad0>, <_sre.SRE_Match object at 0x7f904ac378b0>]

yarikoptic commented 5 years ago

On a possibly related note, may be just to familiarize myself more with pybids semantic of entities/variables.

A BIDSVariableCollection maintains .variables and .entities separately (or actually, it might be the Variables themselves that have the .entities; I don't remember). It's only the to_df call that mixes them, which seems to me like the right thing to do—if you're asking for a tabular summary of your entire collection, you should probably get as much as possible in there.

sorry I was not clear in formulating an actual question -- where does to_df gather subject field which I cannot find among entities or variables?

tyarkoni commented 5 years ago

I think that ideally PyBIDS should be as close to BIDS specification as possible in terms of the metadata fields it extracts/associates with the files. It should not add some additional metadata fields since that might conflict in the future with the fields which BIDS standardizes.

Not sure I share this philosophy. The spec doesn't explicitly mention a lot of things that are either essential or useful to actually support functionality most users will want. E.g., there's no formal definition of "extension" anywhere, AFAICT, but it would be hard to search properly for files without passing that information in. The same is true of suffix. If we were to get rid of it, we would just have to introduce it under another guise—e.g., as the old type field. It's true that the spec doesn't explicitly talk about "suffix" in many places, but it's implicitly doing so almost everywhere, because the thing that identifies the kind of file being described is almost invariably the string just before the extension.

I have raised voice in the google docs times but I think it was "too late" to harmonize, so we are stuck with participants.tsv and sub-

So, overall, I maintain that suffix field where there is no suffix semantic in the BIDS spec for the file, or as explicit specification in particular dataset or participants.tsv, shouldn't happen

I think we're going to have to disagree on this one. I'm on board with not treating participants in participants.tsv as a suffix, but there are a lot of scenarios in which it's very useful to know what type of file a row in a design matrix came from. Since it can be easily suppressed, I don't see a problem with it. I agree that we need a mechanism for dealing with conflicts, but that seems easily doable. A reasonable approach, for example, is to always privilege user-supplied columns over internally generated ones, but issue a warning (and also maybe to have a package-level setting that lets the user change this behavior). If the user supplies a subject or session column, which will happen at some point, I'm not comfortable vetoing that unless the spec explicitly says you can't do that—which it currently doesn't.

What about a bit more evolved matching to assure that suffix is prefixed with at least a single key-value_ pair?

Fine with me, though I don't think it's necessary, as something like study2_participants.tsv would fail the validator anyway (and if the user is explicitly turning validation off, then arguably their intention is to treat participants like a suffix in much the way we currently do).

sorry I was not clear in formulating an actual question -- where does to_df gather subject field which I cannot find among entities or variables?

Entity information is stored in each Variable. When you call to_df on a collection, that implicitly calls to_df on the variable, and at that point it (optionally) extracts the entity info and adds it to the DF. The row-level entity information is stored in the .index property. There's also an .entities property, which stores only those values that are common across all rows. I.e., if you have a run-level variable, .entities will contain things like run, subject, and session (which don't vary within run), whereas .index will additionally contain things like suffix.

yarikoptic commented 5 years ago

E.g., there's no formal definition of "extension" anywhere,

do you mean a "file extension"? it is there and typically mandated to be a specific one (.nii.gz, .tsv. json). And the word extension is mentioned a good number of times.

This discussion gives me an idea that may be we should propose a glossary for BIDS where we define all those terms?

yarikoptic commented 5 years ago

The same is true of suffix. If we were to get rid of it, we would just have to introduce it under another guise—e.g., as the old type field.

hold on -- I am not trying to get rid of it entirely! Just from the files where there is no suffix anyhow associated with, e.g. participants.tsv. Moreover -- what if later someone decides to introduce word prefix? Would participants.tsv have both suffix and prefix == participants, then?

It's true that the spec doesn't explicitly talk about "suffix" in many places, but it's implicitly doing so almost everywhere, because the thing that identifies the kind of file being described is almost invariably the string just before the extension.

It is not, as an argument with prefix above exemplifies, and I do not find any implicit statement that suffix of participants.tsv is "participants". IMHO, as per a grep a few comments back, suffix where mentioned has some clear semantic, indeed close to type, and we have a syntax to separate it out (in those files it is after the final _ where there is no key-value any more) and applicable only to some files, not an every file.

effigies commented 5 years ago

Relevantly, there is a comment in the bids-standard/bids-specification#109 about adding a suffix table.

tyarkoni commented 5 years ago

hold on -- I am not trying to get rid of it entirely! Just from the files where there is no suffix anyhow associated with, e.g. participants.tsv

Dude I already agreed with on you this like 6 comments back. If that's all that's at issue, let's just agree to change the pattern for suffixes to require a preceding _ and we can move on. :p

yarikoptic commented 5 years ago

it's very useful to know what type of file a row in a design matrix came from.

yes -- it would be very useful! But I think for that is more of a property of a metadata record/field than metadata itself. Moreover, different fields might come from different files. May be the records could have an auxiliary data structure pointing to the origins of fields values, e.g. .metadata_origins = {'subject': 'participants.tsv', 'age': 'participants.tsv', 'TR': 'task-theonlyone_bold.json', ...}? Here it might even become possible to record the effect of inheritance to depict which file actually provided the value which was overridden through inheritance principle. Having a single "suffix" then would not suffice.

yarikoptic commented 5 years ago

hold on -- I am not trying to get rid of it entirely! Just from the files where there is no suffix anyhow associated with, e.g. participants.tsv

Dude I already agreed with on you this like 6 comments back. If that's all that's at issue, let's just agree to change the pattern for suffixes to require a preceding _ and we can move on. :p

ah, ok - sorry, carry on, I will stop ;)

yarikoptic commented 5 years ago

hold on -- I am not trying to get rid of it entirely! Just from the files where there is no suffix anyhow associated with, e.g. participants.tsv

Dude I already agreed with on you this like 6 comments back. If that's all that's at issue, let's just agree to change the pattern for suffixes to require a preceding _ and we can move on. :p

you have tricked me!!!

(git-annex)hopa:~/datalad/openfmri/ds000001[master]git
$> python -c 'import  bids as b; print(b.__version__); print(b.BIDSLayout(".").get_collections(level="dataset")[0].to_df().to_dict(orient="records")[0])'
0.8.0
{'age': 26, 'sex': 'F', u'suffix': 'participants', 'subject': '01'}

tyarkoni commented 5 years ago

II didn't say it was in 0.8, just that I'm okay making that change. I tried it out locally and it broke some things I'll need to fix, and I didn't want to stall the release for it. It will happen soon (and sooner still if you submit a PR). ;)