Open sappelhoff opened 4 years ago
- sex MUST be a string (here we need to discuss, which strings we accept. Most straight forward would perhaps be "male, female, undefined", but I would like somebody with a bit more experience in inclusive language to make a suggestion here.
Typically a third option is other
or diverse
– undefined
could potentially be perceived as disrespectful by some. diverse
could potentially lead to confusion with gender (however it's the official "third sex / gender" recognized by the German gov't). So I guess, at this time, I'd be in favor of other
. The latter would also capture inter
, in case we decide not to add that option too.
age MUST be an integer (years since birth)
I don’t understand the prohibition against decimal ages. It is not uncommon to have exact birthdate and scan date (before anonymisation) with which to compute an age with greater than integer precision.
Thanks @hoechenberger and @nicholst I updated the post with your sensible suggestions
@sappelhoff and @hoechenberger - i would suggest making a clearer difference between sex and gender. in bids, sex was adopted in the early days to be chromosomal, while datasets could easily add columns on gender to include and potentially differentiate preference. the language can definitely be improved to both clarify and provide support for including gender in addition to sex, when such information is collected in a study.
more broadly, #423 has some general conventions about augmenting BIDS keys. my vote would be in preference of the validator suggesting that age/sex
have to have corresponding entries in participants.json
. to map on to known units and levels (as is done in other parts of bids). this allows greater flexibility downstream, especially if BIDS is used for datasets containing in utero participants or non-human species where other age units are more appropriate.
Hello @satra, thanks for your comment! I fully agree that having a separate gender
key could be beneficial in certain situations.
I could imagine sth like:
I fully agree that having a separate gender key could be beneficial in certain situations.
Currently we are limiting this discussion to sex
and age
, which have an undocumented "special role" in the validator (see my OP). I'd rather not extend this discussion to gender
. If users want to specify gender
, they can already do so by adding a column to TSV and describing it in JSON.
Note: for gender
, the validator WILL WARN if no description is in JSON ... while for age
and sex
, it will not.
my vote would be in preference of the validator suggesting that age/sex have to have corresponding entries in participants.json. to map on to known units and levels (as is done in other parts of bids). this allows greater flexibility downstream
If I understand correctly, this preference maps to my suggestion 1 in the "how to fix" part of the original post.
I think that allowing downstream flexibility is a good point.
Another point is, that there are many datasets already out there ... and if we clarify now that we want age
to be a float and sex
to be one of (female, male, other [or whatever we would decide]) that would break several datasets.
Therefore I think it'd be best to:
sex
and age
as well as RECOMMEND (i.e., not REQUIRE) factor levels for sex
bids-validator
to explicitly WARN if either age
or sex
are NOT described in an accompanying JSON file--> This will not break existing datasets --> This will hopefully result in more "sane" datasets in the future
HOWEVER
--> existing datasets with age
and sex
not described in JSON are still "broken" (i.e., cannot be parsed automatically in most cases)
--> users who create new datasets may simply ignore a bids-validator
warning ... and hence produce new datasets where age
or sex
cannot be parsed automatically
I think that's a fair tradeoff and a pragmatic solution.
@sappelhoff - yes i was voting for 1. and my reference to gender was simply that the bids key sex should not be confused with gender.
regarding validator and specification interaction, each dataset has a specification version, so a validator should always be checking against the spec rules of that version. this means that for datasets released prior to any change we make to the spec, it should simply accept it as valid (may not be useful, but that's not the validator's responsibility). once the age/sex descriptions are upgraded, current validator can conform to it.
the validator should always be able to make good recommendations at any time. and a new validator could always recommend how to upgrade the dataset from version X to version Y of the spec.
@sappelhoff These suggestions sound very sane to me.
hi guys, I would actually suggest going for error. It may sound a bit radical but I think there are parts of the BIDS specification that are too under-specified and flexible which makes it a nightmare to write software for it.
At least regarding the sex/gender issue, I think there is no good reason to leave it as optional. IMO it should be a compulsory field and if the researcher feels really compelled to leave it out, then they should fill it up as 'n/a'
. I would also not introduce a new mechanism for unknown/others. We should try to be consistent across the specification.
In that sense, this should be considered a bug-fix ... the datasets that break should be retroactively fixed, otherwise we are stuck with a specification that is too flexible.
I would also not introduce a new mechanism for unknown/others.
But there are cases where the biological sex does not clearly fall into the male
/female
dichotomy, and these are different from unknown
(which would be n/a
to me). Designing a sex
key limited to only two values would be a little like having a handedness key without the option for left-handed participants…
yes sure! Sorry I read too quickly, it's fair to have options for sex/genders other than male
and female
of course.
For some context presently in bids-examples the following show up in the sex field:
ds102 being the sole dataset using D
.
Ages are mostly numeric. The outliers being ieeg_epilepsy and ieeg_epilepsy_ecog which have XX
as the age for their sole subject. genetics_ukbb uses 89+
for 4 subjects.
EDIT: commands used to generate above. csvcut was from the csvkit package.
for file in $(find . -name participants.tsv -exec grep -l sex {} \;); do csvcut -c sex -t $file; done | sort | uniq
for file in $(find . -name participants.tsv -exec grep -l age {} \;); do csvcut -c age -t $file; done | sort | uniq
Also 6 datasets don't have an age
column. 8 don't have a sex
column, 2 of which use gender
instead. Both the datasets using gender have participants.json defining the column. Posting one liners I used in case I made a mistake in them:
~/projects/bids-examples $ find . -name participants.tsv -exec grep -lL age {} \; | cut -d '/' -f 2
ieeg_motorMiller2007
ds009
eeg_rest_fmri
ieeg_visual
ds114
eeg_face13
~/projects/bids-examples $ for file in $(find . -name participants.tsv -exec grep -lL sex {} \;); do grep gender -lL $file; done | cut -d '/' -f 2
ieeg_motorMiller2007
ds009
eeg_rest_fmri
ieeg_visual
ds114
eeg_face13
~/projects/bids-examples $ for file in $(find . -name participants.tsv -exec grep -lL sex {} \;); do grep gender -l $file; done | cut -d '/' -f 2
ieeg_filtered_speech
eeg_rishikesh
~/projects/bids-examples $ find . -name participants.tsv -exec grep -lL sex {} \; | cut -d '/' -f 2
ieeg_motorMiller2007
ds009
eeg_rest_fmri
ieeg_filtered_speech
eeg_rishikesh
ieeg_visual
ds114
eeg_face13
Thanks @rwblair that's really helpful. Can you get the same info also for those datasets on OpenNeuro? ~And would it be possible to share the script you may have used for extraction for us to try as well?~ [edit: just saw your follow up post, thanks!]
re: @jasmainak's proposal to go one step further and make bids-validator
return an error instead of a warning:
If we were to REQUIRE a set of inputs for sex
and age
, then we would probably be breaking only very few existing datasets if we allowed:
sex
[chromosomal, nothing to do with gender]:
female
, f
, F
, FEMALE
, Female
male
, m
, M
, MALE
, Male
other
, o
, O
, OTHER
, Other
n/a
(which is always allowed)age
[always in years]:
n/a
(as usual)int-int
, e.g., 80-90
, reflecting age between 80 and 90 and int+
reflecting "older than" --> but I am not sure whether this is a nice feature or an unnecessary burdenregarding "downstream flexibility" --> if in the future we see more and more need for additional "levels", we could add them. Else, there is once more the option for a custom column.
thanks a lot @rwblair for jumping in. This is great and @sappelhoff's proposal is very concrete and simple. I'm not sure about int+
either but agree with the rest of it.
@sappelhoff
sex
[chromosomal, nothing to do with gender]:
Wonder if it shouldn't be "phenotypical" instead of "chromosomal", as there's people with XY chromosomes with a female phenotype, because their androgen receptors are non-functional :-) just my 2 ct
Published OpenNeuro participants.tsv files are a little bit noisier but make a case for a stronger validator warning/error
Outliers in age:
ds002345 has a list of ages for each participant in the age column, like 28,28,28,31,31
. No participants.json.
ds002000 has NaN
for all its ages.
ds001365 has positive and negative floating point numbers like -0.994
. No participants.json, but the next column is mean reaction time, so maybe they forgot to add ages in?
Here are the unique values for sex for every public dataset: https://gist.github.com/rwblair/b00cf41d6693b65d67e3c8c1fe73608c
ds001510, ds002655, ds000121, ds000119 Seem to have their age and sex columns swapped.
The ""
result from ds002221 comes from the file not being properly terminated.
How I copied participants files:
~/datasets/participants $ aws s3 cp s3://openneuro.org . --recursive --exclude "*" --include "*participants.tsv" --no-sign-request
Age values by OpenNeuro dataset: https://gist.github.com/rwblair/deabd683a4229277dd69aa54f39adb1c
It looks like age and sex were added as non-custom fields in https://github.com/bids-standard/bids-validator/pull/905. I don't see any justification for special status, but I would assume it's because they appear in the example, but as noted above there's no text describing their meaning or restricting their values.
So I agree with @sappelhoff and @satra that we should revert the validator to warning when they do not have metadata in participants.json
.
In the long term, I think adding BIDS terms specifying an official default interpretation (#423) is a good idea, and that these terms should be linked to from the spec itself.
I'll go ahead and make a PR to revert those changes in the validator.
I was curious to see how many datasets would start issuing warnings, out of 194 datasets that have age in their participants.tsv, age only appears in 48 of their participants.json. For sex the numbers used in participants.tsv are 175 in and 48 in participants.json. Over all I'm seeing participants.tsv in 244 datasets and participants.json in 66.
TLDR
In the
participants.tsv
file, theage
andsex
columns are sometimes not well defined, and this leads to (unnecessary) issues on the side of tool developers (and thus eventually the users). We should improve either the spec or the validator, or both.cc @jasmainak @agramfort @adam2392 @hoechenberger
came up in: https://github.com/mne-tools/mne-bids/issues/396
Intro
The specification says the following about the participants.tsv file:
so strictly speaking, all columns that are not
participant_id
are OPTIONAL, and thus SHOULD be described in an accompanyingparticipants.json
.For optional columns that are not described, the validator currently emits a warning such as this:
Yet, the validator treats some "optional" columns differently, i.e., these columns are accepted WITHOUT warning. Examples of these are:
However, the specification does not cover that these two variables are "expected optional columns". The expected behavior would be to raise a warning also for age and sex.
I could not pin down the exact part of the validator that is responsible for this behavior, but it may be this line:
https://github.com/bids-standard/bids-validator/blob/dfabbfb058daca406ed1d0897c3a25be059a5ad6/bids-validator/utils/summary/collectSubjectMetadata.js#L31
perhaps @nellh or @rwblair can help
The problem
The issue that arises from this (apart from inconsistency) is that users define their own levels for the
sex
column, and are NOT reminded by the validator to please define their levels further in aparticipant.json
.As a result, these values are hard (or impossible) to parse by software.
E.g., we may have the following
participants.json
:what's
fem
? what'sma
?How to fix?
I think we should do one of the following:
participants.tsv
but have no description in an accompanyingparticipants.json
OR
participants.tsv
part of specification and explicitly say that age and sex are "to-be-expected" columns ... and then also define the expected inputs:age_in_months