bids-standard / bids-specification

Brain Imaging Data Structure (BIDS) Specification

https://bids-specification.readthedocs.io/

Creative Commons Attribution 4.0 International

281 stars 165 forks source link

Extra participants in phenotype TSVs, but not in data set folders should not be an ERROR in validation #914

Open ericearl opened 3 years ago

ericearl commented 3 years ago

Introduction

In this section about the modality agnostic phenotypic data, there is a sentence that I don't agree with (after preparing A LOT of phenotypic data):

... the entries of [the participant_id] column MUST correspond to the subjects in the BIDS dataset and participants.tsv file.

In our case there are more subjects in the phenotype/ TSV data who were screened out (but can still be shared) before MRI or MEG data collection than there are MEG and MRI participant data combined. And our participants.tsv file contained every subject across all phenotype/ TSV data so I would think perhaps from the above sentence that we could be in compliance aside from the "BIDS dataset and participants.tsv file" part of that sentence.

I discovered this behavior by way of running the validator, but I don't think my proposal is a particularly difficult one. I propose one of three changes. They are listed in my order of preference, so I would prefer Option A over the others.

Option A: Inclusive OR/Union

Change the sentence explicitly to accept a union (an inclusive OR) of both the BIDS dataset and the participants.tsv file:

... the entries of [the participant_id] column MUST correspond to the subjects in the BIDS dataset unioned with (inclusive or) the participants.tsv file.

This way phenotype/ TSV data can correspond to either subjects in the participants.tsv or the BIDS dataset all-inclusive.

Option B: Exclusive `participants.tsv`

Change the sentence to accept only the participants.tsv and ignore the BIDS dataset:

... the entries of [the participant_id] column MUST correspond to the subjects in the ~~BIDS dataset and~~ participants.tsv file.

This is a little stricter to really require that if phenotype/ TSV data is prepared, then a valid participants.tsv is prepared as well.

Option C: Soften requirement

Change the sentence to make it a soft and optional thing rather than a hard and rigid rule:

... the entries of [the participant_id] column SHOULD correspond to the subjects in the BIDS dataset and participants.tsv file.

I don't like this option because it is less prescriptive and could be more confusing to someone receiving a BIDS dataset, but it is an option. This would imply phenotype/ TSV data could be asynchronous with the BIDS dataset and participants.tsv, but the validator would only WARN about this being a problem instead of giving an ERROR.

References

For more on this conversation, you can see my NeuroStars post about this. Thank you all for reading, your consideration, and the great work being done here!

tsalo commented 3 years ago

I'm not very familiar with phenotype data, but at first glance it seems like Option B makes the most sense. If we consider the participants.tsv the "master key" of the dataset, then having participants in that file who have phenotypic, but not imaging, data seems completely reasonable. I'll have to review that section of the specification to be sure though.

ericearl commented 3 years ago

Thank you @tsalo for the response!

For those who come to this Issue and aren't sure what phenotype/ TSV data is like, you can imagine it as form or survey responses where the first column is participant_id and the other columns are the responses or results of a particular survey, or blood test, or questionnaire.

To @tsalo's point though: I don't want to require a participants.tsv of every dataset (though that would be wonderful) because the standard now says participants.tsv is a RECOMMENDED file. I think the implication should be that if you have gone through the effort to make phenotype/ TSV data that should imply these participants are part of the dataset whether or not imaging data are present. Perhaps a strong WARNING about a missing participants.tsv file is in order when phenotype/ TSV data is prepared, but I don't feel it needs to be a requirement.

sappelhoff commented 3 years ago

Thanks for raising this issue, I think we should solve it. Let me make an alternative proposal:

Option D: accept non-included data when made explicit

I thought that subject data (sub-xx folders in the BIDS dataset) HAD to correspond to the entries in participants.tsv (if present), that is: I thought the validator would raise an error if I had a row sub-99 in my participants.tsv, but no corresponding sub-99 folder.

It turns out that (1) there is no validator error, and (2) the spec does not prescribe what I thought above (see Participants)

Having said that, I find it weird to have information about participants in participants.tsv for whom there is no other data. My suggestion in case of conflict as described by Eric right now would be:

require some new, standardized column in /phenotype/<measurement_tool>.tsv called included (or something else)
included has the standardized levels "yes" and "no"
If a participant/subject row has "yes" as a value, then we MUST be able to find a corresponding sub-<label> folder in the BIDS data
If there is a "no", there MUST NOT be such a folder
If an include column is properly declared, the bids-validator does not raise an error anymore (as described in the original post here)

I am not married to the name "include" or the levels "yes"/"no", but I like this principle, because it would also allow us to add this check for the optional participants.tsv (albeit only with a "warning" level of severity, for backwards compatibility - to be triggered when participants.tsv contains (1) sub-XX rows that are not present in the data, and (2) no include column is present)

Finally, the include column could get "custom" levels like no_reasonA, where these levels can then be equipped with a proper description for why no inclusion happened in the corresponding JSON file.

Remi-Gau commented 3 years ago

Trying to wrap my head around this.

first I think that the validator should throw a warning if there is a participants.tsv and that some sub-* that have a folder are not listed in it.
then I like the idea of the list of participants in participants.tsv to be a subset of those in phenotype.tsv for phenotype.tsv to track ALL participants whether they have their own sub-* folder or are "just" a line in a TSV in the phenotype folder.

I think that this is option A.

Not a big fan of option D because it is too flexible and adds more columns to deal with and I have seen too many spreadsheets in my life.

sappelhoff commented 3 years ago

first I think that the validator should throw a warning if there is a participants.tsv and that some sub-* that have a folder are not listed in it.

and how about the reverse case? That is, when participants.tsv got more rows than there are corresponding sub-xx folders? --> I'd warn for that as well (as described in my option D)

Remi-Gau commented 3 years ago

first I think that the validator should throw a warning if there is a participants.tsv and that some sub-* that have a folder are not listed in it.

and how about the reverse case? That is, when participants.tsv got more rows than there are corresponding sub-xx folders? --> I'd warn for that as well (as described in my option D)

agreed

Remi-Gau commented 3 years ago

So all of this could potentially be tackled in different "PRs":

one that is purely related to participants.tsv (update spec and validator)
one that is more related to phenotype

Am I reading this right @ericearl ?

ericearl commented 3 years ago

@sappelhoff and @Remi-Gau , thank you for the discussion on this issue. I will attempt to write the conclusion of this discussion so far and please chime in if I am incorrect in some way.

I think @Remi-Gau is saying to do three Pull Requests (PRs):

one [PR] that is purely related to participants.tsv (update spec)

one [PR] that is purely related to participants.tsv (update validator)

one [PR] that is more related to phenotype

I don't think three PRs are necessary because the issue is inter-related between the BIDS dataset, the participants.tsv file, and the phenotype/ folder TSV files. For these two PRs, I need some concrete clarification before I or someone else goes typing out a solution in a PR:

1. BIDS Spec PR

Do a PR for the BIDS Spec to replace...

the entries of [the participant_id] column MUST correspond to the subjects in the BIDS dataset and participants.tsv file

... with...

the entries of [the participant_id] column MUST correspond to the subjects in the BIDS dataset unioned with (inclusive or) the participants.tsv file

... when describing the phenotype/ TSV files' participant_id columns.

2. BIDS Validator PR

Do a PR for the BIDS Validator to:

Replace that intersection logic with union logic between the BIDS dataset and participants.tsv ; and
Use a WARNING instead of an ERROR in cases of participant_id mismatch between the BIDS dataset, the participants.tsv file, and any phenotype/ folder TSV file.

Thanks!

effigies commented 3 years ago

Sorry, I'm having trouble assimilating all the text. Here's a truth table of my understanding.

Assuming that participants.tsv and a phenotype TSV exists:

Phenotype	`participants.tsv`	Subject directory	Result
✅	❌	❌	Error: Subject in phenotype TSV not in participants.tsv
❌	✅	❌	Error: Subject in participants.tsv has no associated data
✅	✅	❌	Warn: Subject in participants.tsv has no subject directory
❌	❌	✅	Error: Subject not listed in participants.tsv
✅	❌	✅	Error: Subject not listed in participants.tsv
❌	✅	✅	Okay
✅	✅	✅	Okay

If participants.tsv doesn't exist:

Phenotype	Subject directory	Result
✅	❌	Error: Subject in phenotype TSV not in participants.tsv and missing data directory
❌	✅	Okay
✅	✅	Okay

ericearl commented 3 years ago

@effigies This truth table perspective is perfectly timed and well curated for this conversation, thanks!

On the 5th row of table 1

Phenotype	`participants.tsv`	Subject directory	Result
✅	❌	✅	Error: Subject not listed in `participants.tsv`

I feel this one should be permitted as just a warning, but this is part of the remaining debates I think.

On the 1st row of table 2

Phenotype	Subject directory	Result
✅	❌	Error: Subject not listed in `participants.tsv`

This is the other I am trying to say should not be an error, but instead be a warning.

At the risk of making this topic bigger than it needs to be... perhaps I could ask this question:

Why should a subject listed in any phenotype/ TSV file be required to be present in either the participants.tsv or the Subject directories?

I would argue the inclusion of the phenotype/ TSV data is better than strictly containing it within the Subject directories or participants.tsv frameworks. To some degree the three (phenotype, participants, and subject directories) could be a three-circle Venn diagram set therefore with seven possible groups.

3-circle Venn Diagram

effigies commented 2 years ago

Okay, I think I've digested most of this thread. I would propose as a coherent, comprehensive rule:

1) If participants.tsv exists, then the participant_label entries are a superset of subject directories and participant labels found in phenotype/.

I'm not sold on the "If participants.tsv does not exist, anything goes" proposal in @ericearl's previous comment. This is less a backwards compatibility problem (we can always relax) and more of a "Why was it originally this way? Has that motivation changed?" problem.

If the fix to

Phenotype	Subject directory	Result
✅	❌	Error: Subject not listed in `participants.tsv`

Is just to add a participants.tsv that is a superset, would that not be preferable?

Why should a subject listed in any phenotype/ TSV file be required to be present in either the participants.tsv or the Subject directories?

I guess this is the "Why was it originally this way?" question. It seems this was the original thread. And the idea was that phenotype/ was a way of modularizing participants.tsv.

It seems to me that requiring a valid participants.tsv to have phenotype/ data would be the simplest thing matching this original intent. Would that invalidate existing datasets though?

ericearl commented 2 years ago

@effigies Thank you for the complete response. I feel you brought up three points above I will try to summarize here:

1. The comprehensive superset rule

I love your comprehensive rule proposal. That satisfies my use case and need. So do I make a bids-specification PR and then a bids-validator PR from here?

And the language above has a typo: participant_label should be written as participant_id. I think you may be thinking of BIDS Apps where the argument is always supposed to be participant_label when feeding an App a participant argument.

2. The "If `participants.tsv` does not exist, anything goes" proposal

I agree to prefer the participants.tsv is the right way.

3. The "why was it originally this way?" question

I could crawl existing datasets with phenotype data to answer that question, but this is less important than just getting the comprehensive superset rule in there for now.

So this is our resolution for now (right?):

If participants.tsv exists, then the participant_id entries MUST be the union of all subject directories and all participant_id entries found among phenotypic and assessment data in the phenotype/ folder.

bids-standard / bids-specification

Extra participants in phenotype TSVs, but not in data set folders should not be an ERROR in validation #914

Introduction

Option A: Inclusive OR/Union

Option B: Exclusive participants.tsv

Option C: Soften requirement

References

Option D: accept non-included data when made explicit

1. BIDS Spec PR

2. BIDS Validator PR

On the 5th row of table 1

On the 1st row of table 2

1. The comprehensive superset rule

2. The "If participants.tsv does not exist, anything goes" proposal

3. The "why was it originally this way?" question

Option B: Exclusive `participants.tsv`

2. The "If `participants.tsv` does not exist, anything goes" proposal