bids-standard / bids-specification

Brain Imaging Data Structure (BIDS) Specification
https://bids-specification.readthedocs.io/
Creative Commons Attribution 4.0 International
264 stars 154 forks source link

Classify "phenotype/" as a datatype directory with no subject/session parent #1828

Open effigies opened 1 month ago

effigies commented 1 month ago

phenotype/ is a bit of an outlier in BIDS terms. Other folders at the top level are either entities (sub-<label>) or their contents are opaque to BIDS. phenotype/, on the other hand, is a collection of .tsv/.json files that are to be validated on the same terms as participants.tsv/participants.json.

I would suggest classifying phenotype as a datatype, distinct from other datatypes only in that it spans multiple subjects, and so the subject and session entities do not apply. BEP036 seems to go some way in a similar direction, permitting a pheno/ datatype within subjects/sessions.

In the (unmerged) PR #1672, I suggest using phenotype as a datatype for the purposes of filename validation, and then carve out some exceptions that allow us to use it that way without it being an official datatype. If we make it a datatype, then the exception can be removed. That it fits with very little modification to the schema and validation (https://github.com/bids-standard/bids-validator/pull/1957), seems to me to be an argument for this classification.

The alternative, as I see it, is to consider phenotype a completely unique category of thing, and all implementations will need to have special code for handling it.

effigies commented 1 month ago

@ericearl @surchs I would appreciate your opinions on this. I think this would complement BEP036, in that subject- or session-specific phenotypes would naturally go inside a sub-<label>/[ses-<label>/]phenotype/ directory.

surchs commented 1 month ago

Thanks @effigies for the ping!

I would suggest classifying phenotype as a datatype, distinct from other datatypes only in that it spans multiple subjects, and so the subject and session entities do not apply. BEP036 seems to go some way in a similar direction, permitting a pheno/ datatype within subjects/sessions.

Yes, we did discuss this option as the "segregated" representation, i.e. "put pheno data in the leaves of the tree" or as you say: "treat pheno as any other data type"

In the current version of BEP036 I am (we are?) leaning to excluding the "segregated" option in favour of the "aggregated" option of a root level /phenotype directory. For me there are three arguments for going with "aggregated" over "segregated":

I think this would complement BEP036, in that subject- or session-specific phenotypes would naturally go inside a sub-

I can see how treating phenotype like any other datatype would make things easier from a BIDS perspective. But at the moment, /phenotype at the root level is allowed in the BIDS spec. Should we allow both the root level phenotype directory and the <sub-<label>/[ses-<label>/]phenotype/ directory, and in the same dataset? I guess not many BIDS datasets (I have seen) make use of the root level phenotype directory yet. So if we only allow the <sub-<label>/[ses-<label>/]phenotype/, that might work. But I think that'll be quite the hill to climb for users who want to organize their data in BIDS, and also for users who want to later do analysis on someone else's data and first have to aggregate things again.

Maybe I'm overcautious here - but in my experience phenotypic data can be the most messy part of a dataset and are often acquired / handled by non-technical people in a research team. So I'm a bit concerned about tranforming data and storing them in a way that makes it easy for hard to detect inconsistencies to sneak in.

I think @barbarastrasser, you also commented on BEP036 about this topic because of your use cases, maybe you could add your thoughts too.

effigies commented 1 month ago

I am not presently trying to make phenotype valid within subject, just to determine if it is a datatype. That it allows us the possibility of enabling it at lower levels if the use cases are compelling seems like an argument that it is that kind of thing. Saying so would not obligate us to define files with this datatype that show up in subject/session directories.

We cannot disable it at the root level in BIDS 1.x, in any case.

surchs commented 1 month ago

Ah OK, guess I misread your question.

distinct from other datatypes only in that it spans multiple subjects, and so the subject and session entities do not apply

So you are proposing to turn phenotype into a BIDS datatype, but a special one that (for now) only exists at the directory root - unlike other BIDS datatypes that only exist in the <sub-<label>/[ses-<label>, yes?

If we make it a datatype, then the exception can be removed. That it fits with very little modification to the schema and validation (https://github.com/bids-standard/bids-validator/pull/1957), seems to me to be an argument for this classification.

I'm not very familiar with the BIDS schema or what the implication of such a change would be. From looking at your PR, my limited understanding is that the proposed change allows you to do some general checks for .tsv and .json files in a root level /phenotype directory. That makes sense to me, especially if it reduces special cases.

Maybe @ericearl would be better here to give feedback.

effigies commented 1 month ago

So you are proposing to turn phenotype into a BIDS datatype, but a special one that (for now) only exists at the directory root - unlike other BIDS datatypes that only exist in the <sub-<label>/[ses-<label>, yes?

Correct.

barbarastrasser commented 1 month ago

Hi everyone,

Maybe first some high-level thoughts on the aggregated vs. segregated approach:

I think it depends a bit on how to look at data. Is the aim to describe a participant in depth (maybe also interesting when looking up imaging and pheno data across datasets) or is the aim to describe a dataset in depth? For the former, it might be easier if everything that is collected is structured the same segregated way - especially when thinking about designing software for automatic querying etc.). For the latter a phenotype directory in the dataset root should be sufficient to my impression.

But I also understand the user perspective. I agree that the aggregated format is the way people acquire phenotype data most of the time, and that it might be easier for them to handle than storing single rows, which is error-prone.

However, issues I witnessed with the current specification is that it is not flexible enough to satisfy the needs of researchers. I know that there are individual efforts going on to split the aggregated data row-wise and store this single line in the sub-<label>/[ses-<label>/]beh directory since there is no other possibility to store the data in a bids compliant way.

The specific problems we encountered were that the way validation is currently handled does not allow for

I think people will use the phenotype directory more, whether it is in the subject directory or in the root directory, as long as there is flexibility to deal with cases like the ones described above.