bids-standard / bids-specification

Brain Imaging Data Structure (BIDS) Specification
https://bids-specification.readthedocs.io/
Creative Commons Attribution 4.0 International
273 stars 156 forks source link

Improve spec/validation of participants.tsv #458

Open sappelhoff opened 4 years ago

sappelhoff commented 4 years ago

TLDR

In the participants.tsv file, the age and sex columns are sometimes not well defined, and this leads to (unnecessary) issues on the side of tool developers (and thus eventually the users). We should improve either the spec or the validator, or both.

cc @jasmainak @agramfort @adam2392 @hoechenberger

came up in: https://github.com/mne-tools/mne-bids/issues/396

Intro

The specification says the following about the participants.tsv file:

In case of single session studies this file has one compulsory column participant_id that consists of sub-

so strictly speaking, all columns that are not participant_id are OPTIONAL, and thus SHOULD be described in an accompanying participants.json.

For optional columns that are not described, the validator currently emits a warning such as this:

1: [WARN] Tabular file contains custom columns not described in a data dictionary (code: 82 - CUSTOM_COLUMN_WITHOUT_DESCRIPTION)
  ./participants.tsv
    Evidence: Columns: group not defined, please define in: /participants.json

Yet, the validator treats some "optional" columns differently, i.e., these columns are accepted WITHOUT warning. Examples of these are:

However, the specification does not cover that these two variables are "expected optional columns". The expected behavior would be to raise a warning also for age and sex.

I could not pin down the exact part of the validator that is responsible for this behavior, but it may be this line:

https://github.com/bids-standard/bids-validator/blob/dfabbfb058daca406ed1d0897c3a25be059a5ad6/bids-validator/utils/summary/collectSubjectMetadata.js#L31

perhaps @nellh or @rwblair can help

The problem

The issue that arises from this (apart from inconsistency) is that users define their own levels for the sex column, and are NOT reminded by the validator to please define their levels further in a participant.json.

As a result, these values are hard (or impossible) to parse by software.

E.g., we may have the following participants.json:

participant_id  age sex
sub-05  25  fem
sub-06  30  ma
sub-07  26  ma

what's fem? what's ma?

How to fix?

I think we should do one of the following:

  1. fix the validator so that it emits a warning if age and sex are columns in participants.tsv but have no description in an accompanying participants.json

OR

  1. Amend the participants.tsv part of specification and explicitly say that age and sex are "to-be-expected" columns ... and then also define the expected inputs:
hoechenberger commented 4 years ago
  • sex MUST be a string (here we need to discuss, which strings we accept. Most straight forward would perhaps be "male, female, undefined", but I would like somebody with a bit more experience in inclusive language to make a suggestion here.

Typically a third option is other or diverseundefined could potentially be perceived as disrespectful by some. diverse could potentially lead to confusion with gender (however it's the official "third sex / gender" recognized by the German gov't). So I guess, at this time, I'd be in favor of other. The latter would also capture inter, in case we decide not to add that option too.

nicholst commented 4 years ago

age MUST be an integer (years since birth)

I don’t understand the prohibition against decimal ages. It is not uncommon to have exact birthdate and scan date (before anonymisation) with which to compute an age with greater than integer precision.

sappelhoff commented 4 years ago

Thanks @hoechenberger and @nicholst I updated the post with your sensible suggestions

satra commented 4 years ago

@sappelhoff and @hoechenberger - i would suggest making a clearer difference between sex and gender. in bids, sex was adopted in the early days to be chromosomal, while datasets could easily add columns on gender to include and potentially differentiate preference. the language can definitely be improved to both clarify and provide support for including gender in addition to sex, when such information is collected in a study.

more broadly, #423 has some general conventions about augmenting BIDS keys. my vote would be in preference of the validator suggesting that age/sex have to have corresponding entries in participants.json. to map on to known units and levels (as is done in other parts of bids). this allows greater flexibility downstream, especially if BIDS is used for datasets containing in utero participants or non-human species where other age units are more appropriate.

hoechenberger commented 4 years ago

Hello @satra, thanks for your comment! I fully agree that having a separate gender key could be beneficial in certain situations.

I could imagine sth like:

sappelhoff commented 4 years ago

I fully agree that having a separate gender key could be beneficial in certain situations.

Currently we are limiting this discussion to sex and age, which have an undocumented "special role" in the validator (see my OP). I'd rather not extend this discussion to gender. If users want to specify gender, they can already do so by adding a column to TSV and describing it in JSON.

Note: for gender, the validator WILL WARN if no description is in JSON ... while for age and sex, it will not.

my vote would be in preference of the validator suggesting that age/sex have to have corresponding entries in participants.json. to map on to known units and levels (as is done in other parts of bids). this allows greater flexibility downstream

If I understand correctly, this preference maps to my suggestion 1 in the "how to fix" part of the original post.

I think that allowing downstream flexibility is a good point.

Another point is, that there are many datasets already out there ... and if we clarify now that we want age to be a float and sex to be one of (female, male, other [or whatever we would decide]) that would break several datasets.

Therefore I think it'd be best to:

  1. Clarify the specification what we mean by sex and age as well as RECOMMEND (i.e., not REQUIRE) factor levels for sex
  2. Amend the bids-validator to explicitly WARN if either age or sex are NOT described in an accompanying JSON file

--> This will not break existing datasets --> This will hopefully result in more "sane" datasets in the future

HOWEVER

--> existing datasets with age and sex not described in JSON are still "broken" (i.e., cannot be parsed automatically in most cases)

--> users who create new datasets may simply ignore a bids-validator warning ... and hence produce new datasets where age or sex cannot be parsed automatically

I think that's a fair tradeoff and a pragmatic solution.

satra commented 4 years ago

@sappelhoff - yes i was voting for 1. and my reference to gender was simply that the bids key sex should not be confused with gender.

regarding validator and specification interaction, each dataset has a specification version, so a validator should always be checking against the spec rules of that version. this means that for datasets released prior to any change we make to the spec, it should simply accept it as valid (may not be useful, but that's not the validator's responsibility). once the age/sex descriptions are upgraded, current validator can conform to it.

the validator should always be able to make good recommendations at any time. and a new validator could always recommend how to upgrade the dataset from version X to version Y of the spec.

hoechenberger commented 4 years ago

@sappelhoff These suggestions sound very sane to me.

jasmainak commented 4 years ago

hi guys, I would actually suggest going for error. It may sound a bit radical but I think there are parts of the BIDS specification that are too under-specified and flexible which makes it a nightmare to write software for it.

At least regarding the sex/gender issue, I think there is no good reason to leave it as optional. IMO it should be a compulsory field and if the researcher feels really compelled to leave it out, then they should fill it up as 'n/a'. I would also not introduce a new mechanism for unknown/others. We should try to be consistent across the specification.

jasmainak commented 4 years ago

In that sense, this should be considered a bug-fix ... the datasets that break should be retroactively fixed, otherwise we are stuck with a specification that is too flexible.

hoechenberger commented 4 years ago

I would also not introduce a new mechanism for unknown/others.

But there are cases where the biological sex does not clearly fall into the male/female dichotomy, and these are different from unknown (which would be n/a to me). Designing a sex key limited to only two values would be a little like having a handedness key without the option for left-handed participants…

jasmainak commented 4 years ago

yes sure! Sorry I read too quickly, it's fair to have options for sex/genders other than male and female of course.

rwblair commented 4 years ago

For some context presently in bids-examples the following show up in the sex field:

ds102 being the sole dataset using D.

Ages are mostly numeric. The outliers being ieeg_epilepsy and ieeg_epilepsy_ecog which have XX as the age for their sole subject. genetics_ukbb uses 89+ for 4 subjects.

EDIT: commands used to generate above. csvcut was from the csvkit package.

for file in $(find . -name participants.tsv -exec grep -l sex {} \;); do csvcut -c sex -t $file; done | sort | uniq
for file in $(find . -name participants.tsv -exec grep -l age {} \;); do csvcut -c age -t $file; done | sort | uniq
rwblair commented 4 years ago

Also 6 datasets don't have an age column. 8 don't have a sex column, 2 of which use gender instead. Both the datasets using gender have participants.json defining the column. Posting one liners I used in case I made a mistake in them:

~/projects/bids-examples $ find . -name participants.tsv -exec grep -lL age {} \; | cut -d '/' -f 2
ieeg_motorMiller2007
ds009
eeg_rest_fmri
ieeg_visual
ds114
eeg_face13

~/projects/bids-examples $ for file in $(find . -name participants.tsv -exec grep -lL sex {} \;); do grep gender -lL $file; done | cut -d '/' -f 2
ieeg_motorMiller2007
ds009
eeg_rest_fmri
ieeg_visual
ds114
eeg_face13

~/projects/bids-examples $ for file in $(find . -name participants.tsv -exec grep -lL sex {} \;); do grep gender -l $file; done | cut -d '/' -f 2
ieeg_filtered_speech
eeg_rishikesh

~/projects/bids-examples $ find . -name participants.tsv -exec grep -lL sex {} \; | cut -d '/' -f 2
ieeg_motorMiller2007
ds009
eeg_rest_fmri
ieeg_filtered_speech
eeg_rishikesh
ieeg_visual
ds114
eeg_face13
sappelhoff commented 4 years ago

Thanks @rwblair that's really helpful. Can you get the same info also for those datasets on OpenNeuro? ~And would it be possible to share the script you may have used for extraction for us to try as well?~ [edit: just saw your follow up post, thanks!]

re: @jasmainak's proposal to go one step further and make bids-validator return an error instead of a warning:

If we were to REQUIRE a set of inputs for sex and age, then we would probably be breaking only very few existing datasets if we allowed:

regarding "downstream flexibility" --> if in the future we see more and more need for additional "levels", we could add them. Else, there is once more the option for a custom column.

jasmainak commented 4 years ago

thanks a lot @rwblair for jumping in. This is great and @sappelhoff's proposal is very concrete and simple. I'm not sure about int+ either but agree with the rest of it.

hoechenberger commented 4 years ago

@sappelhoff

sex [chromosomal, nothing to do with gender]:

Wonder if it shouldn't be "phenotypical" instead of "chromosomal", as there's people with XY chromosomes with a female phenotype, because their androgen receptors are non-functional :-) just my 2 ct

rwblair commented 4 years ago

Published OpenNeuro participants.tsv files are a little bit noisier but make a case for a stronger validator warning/error

Outliers in age: ds002345 has a list of ages for each participant in the age column, like 28,28,28,31,31. No participants.json. ds002000 has NaN for all its ages. ds001365 has positive and negative floating point numbers like -0.994. No participants.json, but the next column is mean reaction time, so maybe they forgot to add ages in?

Here are the unique values for sex for every public dataset: https://gist.github.com/rwblair/b00cf41d6693b65d67e3c8c1fe73608c

ds001510, ds002655, ds000121, ds000119 Seem to have their age and sex columns swapped.

The "" result from ds002221 comes from the file not being properly terminated.

How I copied participants files: ~/datasets/participants $ aws s3 cp s3://openneuro.org . --recursive --exclude "*" --include "*participants.tsv" --no-sign-request

rwblair commented 4 years ago

Age values by OpenNeuro dataset: https://gist.github.com/rwblair/deabd683a4229277dd69aa54f39adb1c

effigies commented 4 years ago

It looks like age and sex were added as non-custom fields in https://github.com/bids-standard/bids-validator/pull/905. I don't see any justification for special status, but I would assume it's because they appear in the example, but as noted above there's no text describing their meaning or restricting their values.

So I agree with @sappelhoff and @satra that we should revert the validator to warning when they do not have metadata in participants.json.


In the long term, I think adding BIDS terms specifying an official default interpretation (#423) is a good idea, and that these terms should be linked to from the spec itself.

rwblair commented 4 years ago

I'll go ahead and make a PR to revert those changes in the validator.

I was curious to see how many datasets would start issuing warnings, out of 194 datasets that have age in their participants.tsv, age only appears in 48 of their participants.json. For sex the numbers used in participants.tsv are 175 in and 48 in participants.json. Over all I'm seeing participants.tsv in 244 datasets and participants.json in 66.