Open ericearl opened 3 years ago
I'm not very familiar with phenotype data, but at first glance it seems like Option B makes the most sense. If we consider the participants.tsv
the "master key" of the dataset, then having participants in that file who have phenotypic, but not imaging, data seems completely reasonable. I'll have to review that section of the specification to be sure though.
Thank you @tsalo for the response!
For those who come to this Issue and aren't sure what phenotype/
TSV data is like, you can imagine it as form or survey responses where the first column is participant_id
and the other columns are the responses or results of a particular survey, or blood test, or questionnaire.
To @tsalo's point though: I don't want to require a participants.tsv
of every dataset (though that would be wonderful) because the standard now says participants.tsv
is a RECOMMENDED file. I think the implication should be that if you have gone through the effort to make phenotype/
TSV data that should imply these participants are part of the dataset whether or not imaging data are present. Perhaps a strong WARNING about a missing participants.tsv
file is in order when phenotype/
TSV data is prepared, but I don't feel it needs to be a requirement.
Thanks for raising this issue, I think we should solve it. Let me make an alternative proposal:
I thought that subject data (sub-xx
folders in the BIDS dataset) HAD to correspond to the entries in participants.tsv
(if present), that is: I thought the validator would raise an error if I had a row sub-99
in my participants.tsv
, but no corresponding sub-99
folder.
It turns out that (1) there is no validator error, and (2) the spec does not prescribe what I thought above (see Participants)
Having said that, I find it weird to have information about participants in participants.tsv
for whom there is no other data. My suggestion in case of conflict as described by Eric right now would be:
/phenotype/<measurement_tool>.tsv
called included
(or something else)included
has the standardized levels "yes"
and "no"
"yes"
as a value, then we MUST be able to find a corresponding sub-<label>
folder in the BIDS data"no"
, there MUST NOT be such a folderinclude
column is properly declared, the bids-validator does not raise an error anymore (as described in the original post here)I am not married to the name "include" or the levels "yes"/"no", but I like this principle, because it would also allow us to add this check for the optional participants.tsv
(albeit only with a "warning" level of severity, for backwards compatibility - to be triggered when participants.tsv
contains (1) sub-XX rows that are not present in the data, and (2) no include
column is present)
Finally, the include
column could get "custom" levels like no_reasonA
, where these levels can then be equipped with a proper description for why no inclusion happened in the corresponding JSON file.
Trying to wrap my head around this.
first I think that the validator should throw a warning if there is a participants.tsv
and that some sub-*
that have a folder are not listed in it.
then I like the idea of the list of participants in participants.tsv
to be a subset of those in phenotype.tsv
for phenotype.tsv
to track ALL participants whether they have their own sub-*
folder or are "just" a line in a TSV in the phenotype folder.
I think that this is option A.
Not a big fan of option D because it is too flexible and adds more columns to deal with and I have seen too many spreadsheets in my life.
first I think that the validator should throw a warning if there is a participants.tsv and that some sub-* that have a folder are not listed in it.
and how about the reverse case? That is, when participants.tsv got more rows than there are corresponding sub-xx folders? --> I'd warn for that as well (as described in my option D)
first I think that the validator should throw a warning if there is a participants.tsv and that some sub-* that have a folder are not listed in it.
and how about the reverse case? That is, when participants.tsv got more rows than there are corresponding sub-xx folders? --> I'd warn for that as well (as described in my option D)
agreed
So all of this could potentially be tackled in different "PRs":
Am I reading this right @ericearl ?
@sappelhoff and @Remi-Gau , thank you for the discussion on this issue. I will attempt to write the conclusion of this discussion so far and please chime in if I am incorrect in some way.
I think @Remi-Gau is saying to do three Pull Requests (PRs):
- one [PR] that is purely related to participants.tsv (update spec)
- one [PR] that is purely related to participants.tsv (update validator)
- one [PR] that is more related to phenotype
I don't think three PRs are necessary because the issue is inter-related between the BIDS dataset, the participants.tsv
file, and the phenotype/
folder TSV files. For these two PRs, I need some concrete clarification before I or someone else goes typing out a solution in a PR:
Do a PR for the BIDS Spec to replace...
the entries of [the participant_id] column MUST correspond to the subjects in the BIDS dataset and
participants.tsv
file
... with...
the entries of [the participant_id] column MUST correspond to the subjects in the BIDS dataset unioned with (inclusive or) the
participants.tsv
file
... when describing the phenotype/
TSV files' participant_id
columns.
Do a PR for the BIDS Validator to:
participants.tsv
; andparticipant_id
mismatch between the BIDS dataset, the participants.tsv
file, and any phenotype/
folder TSV file.Thanks!
Sorry, I'm having trouble assimilating all the text. Here's a truth table of my understanding.
Assuming that participants.tsv and a phenotype TSV exists:
Phenotype | participants.tsv |
Subject directory | Result |
---|---|---|---|
✅ | ❌ | ❌ | Error: Subject in phenotype TSV not in participants.tsv |
❌ | ✅ | ❌ | Error: Subject in participants.tsv has no associated data |
✅ | ✅ | ❌ | Warn: Subject in participants.tsv has no subject directory |
❌ | ❌ | ✅ | Error: Subject not listed in participants.tsv |
✅ | ❌ | ✅ | Error: Subject not listed in participants.tsv |
❌ | ✅ | ✅ | Okay |
✅ | ✅ | ✅ | Okay |
If participants.tsv doesn't exist:
Phenotype | Subject directory | Result |
---|---|---|
✅ | ❌ | Error: Subject in phenotype TSV not in participants.tsv and missing data directory |
❌ | ✅ | Okay |
✅ | ✅ | Okay |
@effigies This truth table perspective is perfectly timed and well curated for this conversation, thanks!
Phenotype | participants.tsv |
Subject directory | Result |
---|---|---|---|
✅ | ❌ | ✅ | Error: Subject not listed in participants.tsv |
I feel this one should be permitted as just a warning, but this is part of the remaining debates I think.
Phenotype | Subject directory | Result |
---|---|---|
✅ | ❌ | Error: Subject not listed in participants.tsv |
This is the other I am trying to say should not be an error, but instead be a warning.
At the risk of making this topic bigger than it needs to be... perhaps I could ask this question:
Why should a subject listed in any
phenotype/
TSV file be required to be present in either theparticipants.tsv
or the Subject directories?
I would argue the inclusion of the phenotype/
TSV data is better than strictly containing it within the Subject directories or participants.tsv
frameworks. To some degree the three (phenotype, participants, and subject directories) could be a three-circle Venn diagram set therefore with seven possible groups.
Okay, I think I've digested most of this thread. I would propose as a coherent, comprehensive rule:
1) If participants.tsv
exists, then the participant_label
entries are a superset of subject directories and participant labels found in phenotype/
.
I'm not sold on the "If participants.tsv
does not exist, anything goes" proposal in @ericearl's previous comment. This is less a backwards compatibility problem (we can always relax) and more of a "Why was it originally this way? Has that motivation changed?" problem.
If the fix to
Phenotype | Subject directory | Result |
---|---|---|
✅ | ❌ | Error: Subject not listed in participants.tsv |
Is just to add a participants.tsv
that is a superset, would that not be preferable?
Why should a subject listed in any
phenotype/
TSV file be required to be present in either theparticipants.tsv
or the Subject directories?
I guess this is the "Why was it originally this way?" question. It seems this was the original thread. And the idea was that phenotype/
was a way of modularizing participants.tsv
.
It seems to me that requiring a valid participants.tsv
to have phenotype/
data would be the simplest thing matching this original intent. Would that invalidate existing datasets though?
@effigies Thank you for the complete response. I feel you brought up three points above I will try to summarize here:
I love your comprehensive rule proposal. That satisfies my use case and need. So do I make a bids-specification PR and then a bids-validator PR from here?
And the language above has a typo: participant_label
should be written as participant_id
. I think you may be thinking of BIDS Apps where the argument is always supposed to be participant_label
when feeding an App a participant argument.
participants.tsv
does not exist, anything goes" proposalI agree to prefer the participants.tsv is the right way.
I could crawl existing datasets with phenotype data to answer that question, but this is less important than just getting the comprehensive superset rule in there for now.
So this is our resolution for now (right?):
If
participants.tsv
exists, then theparticipant_id
entries MUST be the union of all subject directories and allparticipant_id
entries found among phenotypic and assessment data in thephenotype/
folder.
Introduction
In this section about the modality agnostic phenotypic data, there is a sentence that I don't agree with (after preparing A LOT of phenotypic data):
In our case there are more subjects in the
phenotype/
TSV data who were screened out (but can still be shared) before MRI or MEG data collection than there are MEG and MRI participant data combined. And ourparticipants.tsv
file contained every subject across allphenotype/
TSV data so I would think perhaps from the above sentence that we could be in compliance aside from the "BIDS dataset andparticipants.tsv
file" part of that sentence.I discovered this behavior by way of running the validator, but I don't think my proposal is a particularly difficult one. I propose one of three changes. They are listed in my order of preference, so I would prefer Option A over the others.
Option A: Inclusive OR/Union
Change the sentence explicitly to accept a union (an inclusive OR) of both the BIDS dataset and the
participants.tsv
file:This way
phenotype/
TSV data can correspond to either subjects in theparticipants.tsv
or the BIDS dataset all-inclusive.Option B: Exclusive
participants.tsv
Change the sentence to accept only the
participants.tsv
and ignore the BIDS dataset:This is a little stricter to really require that if
phenotype/
TSV data is prepared, then a validparticipants.tsv
is prepared as well.Option C: Soften requirement
Change the sentence to make it a soft and optional thing rather than a hard and rigid rule:
I don't like this option because it is less prescriptive and could be more confusing to someone receiving a BIDS dataset, but it is an option. This would imply
phenotype/
TSV data could be asynchronous with the BIDS dataset andparticipants.tsv
, but the validator would only WARN about this being a problem instead of giving an ERROR.References
For more on this conversation, you can see my NeuroStars post about this. Thank you all for reading, your consideration, and the great work being done here!