Open yarikoptic opened 5 years ago
btw -- doesn't happen in 0.8!
$> python -c 'import bids as b; print(b.__version__); print(b.BIDSLayout(".").get_collections(level="dataset")[0].to_df().to_dict(orient="records"))'
0.7.0+51.g4da3e8d
[{'language': '\xd1\x80\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9', 'gender': nan, 'age': '30-35', 'handedness': 'r', 'hearing_problems_current': 'n', 'subject': '01'}, {'language': 'english', 'gender': 'f', 'age': '20-25', 'handedness': 'r', 'hearing_problems_current': 'n', 'subject': '03'}]
as for the question on "what if I have suffix column" - 0.8 works correctly, in 0.7.1. gets overwritten with suffix="participants". in 0.7.0 also as good as 0.8. FWIW
$> python -c 'import bids as b; print(b.__version__); print(b.BIDSLayout(".").get_collections(level="dataset")[0].to_df().to_dict(orient="records"))'
0.7.1
[{'suffix': 'participants', 'language': '\xd1\x80\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9', 'gender': nan, 'age': '30-35', 'handedness': 'r', 'hearing_problems_current': 'n', 'subject': '01'}, {'suffix': 'participants', 'language': 'english', 'gender': 'f', 'age': '20-25', 'handedness': 'r', 'hearing_problems_current': 'n', 'subject': '03'}]
$> cat participants.tsv
participant_id gender age handedness hearing_problems_current language suffix
sub-01 n/a 30-35 r n русский suf1
sub-03 f 20-25 r n english suf2
(dev) 1 21448.....................................:Fri 08 Feb 2019 05:33:30 PM EST:.
(git-annex)hopa:~/.tmp/datalad_temp_tree_test_within_ds_file_searchhwDbsv[master]
$> cat 'import bids as b; print(b.__version__); print(b.BIDSLayout(".").get_collections(level="dataset")[0].to_df().to_dict(orient="records"))'
cat: 'import bids as b; print(b.__version__); print(b.BIDSLayout(".").get_collections(level="dataset")[0].to_df().to_dict(orient="records"))': No such file or directory
(dev) 1 21450 ->1.....................................:Fri 08 Feb 2019 05:33:38 PM EST:.
(git-annex)hopa:~/.tmp/datalad_temp_tree_test_within_ds_file_searchhwDbsv[master]
$> python -c 'import bids as b; print(b.__version__); print(b.BIDSLayout(".").get_collections(level="dataset")[0].to_df().to_dict(orient="records"))'
0.7.0+51.g4da3e8d
[{'suffix': 'suf1', 'language': '\xd1\x80\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9', 'gender': nan, 'age': '30-35', 'handedness': 'r', 'hearing_problems_current': 'n', 'subject': '01'}, {'suffix': 'suf2', 'language': 'english', 'gender': 'f', 'age': '20-25', 'handedness': 'r', 'hearing_problems_current': 'n', 'subject': '03'}]
If it works in 0.8, can we close this? The internals are almost completely reworked in 0.8, so there's no point in patching this for 0.7.1—I don't want to actively support multiple parallel versions, pybids isn't Python ;)
well - since it is present in the released version, best to keep it open. I have added a label fixed-in-0.8
to mark such issues which you could later safely close with 0.8 release.
Re 0.8 not having it (yet) -- 0.8 is a branch which started before 0.7.1 which is on master. What is your plan for arranging the branches, I guess if 0.8 to come to master it would be a merge which would disregard all changes in its current shape?
I'm going to merge master into 0.8, but there will probably be some recent PRs that no longer make sense given the refactoring in 0.8.
Then issues present in master could resurface in 0.8 . Would be worthwhile to make tests for them
Oh, sorry, I didn't read the original report carefully. This is somewhere between bug and feature. What was previously happening is that we were restricting the entities tracked inside variables to just run, subject, session, and task. This was problematic because in some cases other information is needed. So the fix was to expand the set of tracked entities to pretty much everything defined in bids.json
. The unfortunate consequence of this is that you now get fields like suffix
even when they're no use to you.
We probably can't revert to the original behavior, and there are cases where it's clearly useful to know what the source suffix of a row was—e.g., for a run-level BIDSVariableCollection
, you can have rows from regressors
, events
, physio
, etc., and it seems worth retaining that information. @yarikoptic, is your objection to having suffix
show up at all, or to the fact that it's showing up for a file (participants.tsv
) that technically shouldn't have a suffix? The latter seems like a minor bug we can fix by changing the suffix
pattern to require a preceding _
. Hopefully that will sort the issue out. But if your objection is to the former, then maybe we should think about handling that via an optional argument that allows users to control which "extra" entities get returned. WDYT?
I should also note that making it so that participants.tsv
doesn't have a suffix is arguably inconsistent with the way we treat all other files... i.e., it seems reasonable to interpret the suffix as "the thing just before the extension", even if there's nothing else in the filename. One benefit of this convention is it allows you to retrieve participants.tsv
via get()
. While it's probably not ideal, I don't think there's actually any other way to do that at the moment (though that's probably a good reason to add a full-string search option).
Thinking out loud while looking at
$> python -c 'import bids as b; print(b.__version__); print(b.BIDSLayout(".").get_collections(level="dataset")[0].to_df().to_dict(orient="records"))'
0.7.1
[{u'suffix': 'participants', 'language': '\xd1\x80\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9',
'gender': nan, 'age': '30-35', 'handedness': 'r', 'hearing_problems_current': 'n', 'subject': '01'},
{u'suffix': 'participants', 'language': 'english', 'gender': 'f', 'age': '20-25', 'handedness': 'r', 'hearing_problems_current': 'n', 'subject': '03'}]
For me, the suffix
is generally a "leftover" after all the key-value
pairs in the filename right before the suffix. So far it kinda meets your description of the intention. But if we dig deeper, that _suffix
is yet another metadata point, e.g. it used to be a "modality" (when talking about func/ or anat/). So it makes sense to place it into the record for that file. Looking back at participants.tsv, that is why suffix
is not really aligned the same way with it or other files which are not exactly the files following key-value_suffix.ext
convention (CHANGES
, README.md
and even dataset_description.json
) . Here, 'suffix': "participants"
isn't really a metadata point associated with a subject record. I kinda see that it might be useful to be available at some level (entire collection?) to possibly describe the origin of that record/collection, but it is not a metadata field to be added into the record itself. As I've mentioned before, it might even collide with a legit according to BIDS "suffix" column happen I had such one in participants.tsv
. So IMHO it better to not overgeneralize suffix
to be a "file suffix for any file as the part preceding possible _ in the filename", but retain at least some semantic to it (as being present only in the files following key-values_suffix.ext
convention where suffix
provides additional metadata).
On a possibly related note, may be just to familiarize myself more with pybids semantic of entities/variables. Looking at that BIDSVariableCollection:
(git-annex)hopa:~/datalad/openneuro/ds000001[master]git-annex
$> head -n 2 participants.tsv
participant_id sex age
sub-01 F 26
$> python -c 'import bids as b; print(b.BIDSLayout(".").get_collections(level="dataset")[0].to_df().to_dict(orient="records")[0])' {'age': 26, 'sex': 'F', u'suffix': 'participants', 'subject': '01'}
$> python -c 'import bids as b; print((b.BIDSLayout(".").get_collections(level="dataset")[0]).entities)'
{u'suffix': 'participants'}
$> python -c 'import bids as b; print((b.BIDSLayout(".").get_collections(level="dataset")[0]).variables)'
{'age': <bids.variables.variables.SimpleVariable object at 0x7f1887dddd50>, 'sex': <bids.variables.variables.SimpleVariable object at 0x7f1887cdd450>}
so that record combines variables with entities values (what are those exactly?) and I cannot get what subject index belongs to:
$> python -c 'import bids as b; print(dir(b.BIDSLayout(".").get_collections(level="dataset")[0]))'
['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_index_entities', 'clone', 'entities', 'from_df', 'level', 'match_variables', 'matches_entities', 'merge_variables', 'to_df', 'variables']
$> python -c 'import bids as b; print((b.BIDSLayout(".").get_collections(level="dataset")[0])._index_entities)'
<bound method BIDSVariableCollection._index_entities of <bids.variables.kollekshuns.BIDSVariableCollection object at 0x7f56965fbad0>>
# So it is a method!
$> python -c 'import bids as b; print((b.BIDSLayout(".").get_collections(level="dataset")[0])._index_entities())'
None
The issue of column names conflicting is a separate one, IMO... In principle users could also put columns named subject
, run
, or session
in their tsv files, and I don't think the spec currently bars that. The way to deal with that is to have a general check for conflicts and raise a warning or error if one is found. But I don't think we need to handle suffix
as a special case.
I'm okay changing the suffix
pattern to force a _
to exist before the suffix. But it's conceivable this could break someone's existing workflow (though it seems unlikely). Let me think it through and make sure there aren't any other unanticipated consequences.
On a possibly related note, may be just to familiarize myself more with pybids semantic of entities/variables.
A BIDSVariableCollection
maintains .variables
and .entities
separately (or actually, it might be the Variables
themselves that have the .entities
; I don't remember). It's only the to_df
call that mixes them, which seems to me like the right thing to do—if you're asking for a tabular summary of your entire collection, you should probably get as much as possible in there. Note that you can pass entities=False
if you don't want to include the entities as columns. (This doesn't show up in the docstrings because it's passed forward as kwargs
, but maybe it's important enough to make explicit.)
ok, let me take another approach, which relates to my "semantic argument".
I think that ideally PyBIDS should be as close to BIDS specification as possible in terms of the metadata fields it extracts/associates with the files. It should not add some additional metadata fields since that might conflict in the future with the fields which BIDS standardizes.
ATM "suffix" term is actually not clearly formalized in the specification, and if mentioned anywhere it is mentioned as the indicator of a contrast, or being physio
, or stim
:
$> git grep suffix
src/04-modality-specific-files/01-magnetic-resonance-imaging-data.md:| NegativeContrast | OPTIONAL. Boolean (`true` or `false`) value specifying whether increasing voxel intensity (within sample voxels) denotes a decreased value with respect to the contrast suffix. This is commonly the case when Cerebral Blood Volume is estimated via usage of a contrast agent in conjunction with a T2\* weighted acquisition protocol. |
src/04-modality-specific-files/04-physiological-and-other-continous-recordings.md:suffix. For example: `sub-control01_task-nback_run-1`. If the same continuous
src/04-modality-specific-files/04-physiological-and-other-continous-recordings.md:suffix, and signals related to the stimulus should use `_stim` suffix. For
src/04-modality-specific-files/04-physiological-and-other-continous-recordings.md:`_physio` suffix.
src/CHANGES.md:- Added recommendation to use \_physio suffix for continuous recordings
src/pregh-changes.md:- Added recommendation to use \_physio suffix for continuous recordings
It is not mentioned for any other auxiliary file such as participants.tsv, README, etc. So it would be reasonable to have suffix
only for those data files where the term suffix is used in BIDS specification.
Re conflicts:
In principle users could also put columns named
subject
,run
, orsession
in their tsv files, and I don't think the spec currently bars that.
Thanks for mentioning subject
. I think that is one of the rush-caused deficiencies in BIDS - use of both "participant" and "subject". I have raised voice in the google docs times but I think it was "too late" to harmonize, so we are stuck with participants.tsv
and sub-
, and the specification using both terms interchangeably and it is somewhat hardcoded in the BIDS users minds. So now participants.tsv
has participant_id
already and adding subject
column although possible, like adding any other column with identical "semantically" meaning would unlikely to be done. run
and session
also have clear semantics in BIDS, so unlikely to be placed into participants.tsv
although, I must say, session
column might even make sense in some cases where dataset was not collected/prepared as multisession, BUT researcher might decide to associate some session with any particular subject (e.g. as an index of a scanning session participant ever had).
So, overall, I maintain that suffix
field where there is no suffix
semantic in the BIDS spec for the file, or as explicit specification in particular dataset or participants.tsv, shouldn't happen
I'm okay changing the
suffix
pattern to force a_
to exist before the suffix.
What about a bit more evolved matching to assure that suffix
is prefixed with at least a single key-value_
pair?
In [24]: samples = ['participants.tsv', 'dataset_description.json', 'task-blah_bo
...: ld.json', 'sub-01/func/sub-01_task-t1-01_bold.nii.gz']
In [25]: r = re.compile('^(\S+/)?(?:\S+-\S+_)+(?P<suffix>[^\s\.]+)\..*'); res = [
...: r.match(s) for s in samples]; print(res)
[None, None, <_sre.SRE_Match object at 0x7f904ac37ad0>, <_sre.SRE_Match object at 0x7f904ac378b0>]
On a possibly related note, may be just to familiarize myself more with pybids semantic of entities/variables.
A
BIDSVariableCollection
maintains.variables
and.entities
separately (or actually, it might be theVariables
themselves that have the.entities
; I don't remember). It's only theto_df
call that mixes them, which seems to me like the right thing to do—if you're asking for a tabular summary of your entire collection, you should probably get as much as possible in there.
sorry I was not clear in formulating an actual question -- where does to_df
gather subject
field which I cannot find among entities or variables?
I think that ideally PyBIDS should be as close to BIDS specification as possible in terms of the metadata fields it extracts/associates with the files. It should not add some additional metadata fields since that might conflict in the future with the fields which BIDS standardizes.
Not sure I share this philosophy. The spec doesn't explicitly mention a lot of things that are either essential or useful to actually support functionality most users will want. E.g., there's no formal definition of "extension" anywhere, AFAICT, but it would be hard to search properly for files without passing that information in. The same is true of suffix. If we were to get rid of it, we would just have to introduce it under another guise—e.g., as the old type
field. It's true that the spec doesn't explicitly talk about "suffix" in many places, but it's implicitly doing so almost everywhere, because the thing that identifies the kind of file being described is almost invariably the string just before the extension.
I have raised voice in the google docs times but I think it was "too late" to harmonize, so we are stuck with
participants.tsv
andsub-
So, overall, I maintain that
suffix
field where there is nosuffix
semantic in the BIDS spec for the file, or as explicit specification in particular dataset or participants.tsv, shouldn't happen
I think we're going to have to disagree on this one. I'm on board with not treating participants
in participants.tsv
as a suffix, but there are a lot of scenarios in which it's very useful to know what type of file a row in a design matrix came from. Since it can be easily suppressed, I don't see a problem with it. I agree that we need a mechanism for dealing with conflicts, but that seems easily doable. A reasonable approach, for example, is to always privilege user-supplied columns over internally generated ones, but issue a warning (and also maybe to have a package-level setting that lets the user change this behavior). If the user supplies a subject
or session
column, which will happen at some point, I'm not comfortable vetoing that unless the spec explicitly says you can't do that—which it currently doesn't.
What about a bit more evolved matching to assure that
suffix
is prefixed with at least a singlekey-value_
pair?
Fine with me, though I don't think it's necessary, as something like study2_participants.tsv
would fail the validator anyway (and if the user is explicitly turning validation off, then arguably their intention is to treat participants
like a suffix in much the way we currently do).
sorry I was not clear in formulating an actual question -- where does
to_df
gathersubject
field which I cannot find among entities or variables?
Entity information is stored in each Variable
. When you call to_df
on a collection, that implicitly calls to_df
on the variable, and at that point it (optionally) extracts the entity info and adds it to the DF. The row-level entity information is stored in the .index
property. There's also an .entities
property, which stores only those values that are common across all rows. I.e., if you have a run-level variable, .entities
will contain things like run
, subject
, and session
(which don't vary within run), whereas .index
will additionally contain things like suffix
.
E.g., there's no formal definition of "extension" anywhere,
do you mean a "file extension"? it is there and typically mandated to be a specific one (.nii.gz, .tsv. json). And the word extension is mentioned a good number of times.
This discussion gives me an idea that may be we should propose a glossary for BIDS where we define all those terms?
The same is true of suffix. If we were to get rid of it, we would just have to introduce it under another guise—e.g., as the old
type
field.
hold on -- I am not trying to get rid of it entirely! Just from the files where there is no suffix
anyhow associated with, e.g. participants.tsv
. Moreover -- what if later someone decides to introduce word prefix
? Would participants.tsv
have both suffix and prefix == participants
, then?
It's true that the spec doesn't explicitly talk about "suffix" in many places, but it's implicitly doing so almost everywhere, because the thing that identifies the kind of file being described is almost invariably the string just before the extension.
It is not, as an argument with prefix
above exemplifies, and I do not find any implicit statement that suffix of participants.tsv
is "participants". IMHO, as per a grep a few comments back, suffix where mentioned has some clear semantic, indeed close to type
, and we have a syntax to separate it out (in those files it is after the final _
where there is no key-value
any more) and applicable only to some files, not an every file.
Relevantly, there is a comment in the bids-standard/bids-specification#109 about adding a suffix table.
hold on -- I am not trying to get rid of it entirely! Just from the files where there is no
suffix
anyhow associated with, e.g.participants.tsv
Dude I already agreed with on you this like 6 comments back. If that's all that's at issue, let's just agree to change the pattern for suffixes to require a preceding _
and we can move on. :p
it's very useful to know what type of file a row in a design matrix came from.
yes -- it would be very useful! But I think for that is more of a property of a metadata record/field than metadata itself.
Moreover, different fields might come from different files. May be the records could have an auxiliary data structure pointing to the origins of fields values, e.g. .metadata_origins = {'subject': 'participants.tsv', 'age': 'participants.tsv', 'TR': 'task-theonlyone_bold.json', ...}
? Here it might even become possible to record the effect of inheritance to depict which file actually provided the value which was overridden through inheritance principle. Having a single "suffix" then would not suffice.
hold on -- I am not trying to get rid of it entirely! Just from the files where there is no
suffix
anyhow associated with, e.g.participants.tsv
Dude I already agreed with on you this like 6 comments back. If that's all that's at issue, let's just agree to change the pattern for suffixes to require a preceding
_
and we can move on. :p
ah, ok - sorry, carry on, I will stop ;)
hold on -- I am not trying to get rid of it entirely! Just from the files where there is no
suffix
anyhow associated with, e.g.participants.tsv
Dude I already agreed with on you this like 6 comments back. If that's all that's at issue, let's just agree to change the pattern for suffixes to require a preceding
_
and we can move on. :p
you have tricked me!!!
(git-annex)hopa:~/datalad/openfmri/ds000001[master]git
$> python -c 'import bids as b; print(b.__version__); print(b.BIDSLayout(".").get_collections(level="dataset")[0].to_df().to_dict(orient="records")[0])'
0.8.0
{'age': 26, 'sex': 'F', u'suffix': 'participants', 'subject': '01'}
II didn't say it was in 0.8, just that I'm okay making that change. I tried it out locally and it broke some things I'll need to fix, and I didn't want to stall the release for it. It will happen soon (and sooner still if you submit a PR). ;)
didn't check what would happen if I add a
suffix
column toparticipants.tsv
but imho smells like a regression which causedsuffix
to be added where it shouldn't be... please confirm or clarify:now with 0.7.1 -- see "suffix" to appear: