How do I find Healthy Controls using AIRR fields?

bcorrie commented 3 years ago

Since Healthy isn't a disease, there is no DOID entry for it. So you can't say diagnosis.disease_diagnosis = Healthy.

It isn't safe to assume that diagnosis.disease_diagnosis == null means that the subject is healthy at this time point, because the data might simply be missing.

iReceptor currently uses an internal controlled vocabulary for study.study_description to capture which study arm a subject is in. Obviously not ideal as this is not consistent. I think this needs to be sorted out, as this is a pretty important distinction to be able to make and we have had several users ask about how to determine whether a given repertoire from a given individual is a healthy control data set or not... Maybe as part of the OntoVoc sprint?

bussec commented 3 years ago

This is the use case for which we included the diagnosis.study_group_description field in MiAIRR. It of course does not solve the problem about the vocabulary used and I am having second thoughts about whether it should be part of Diagnosis rather then being part of Subject.

Regarding term, would this section of NCIT do: [NCIT:C70665]? This is specific for human subjects, but in our current schema this would be ok, as DOID is also human specific, which make Diagnosis as a whole human-specific...

bcorrie commented 3 years ago

We use an internal controlled vocabulary in this field for this purpose, but it is relevant only to studies that are in iReceptor repositories - which means you can't use it to find data comparatively between repositories - which is why it would be nice to have something structured. I am not sure the NCIT ontology works. I think we have Case and Control, but Control are not always Healthy, which is the only "Control" category in the ontology (that I can find). For example, what if you have a study of people with COVID-19 some that receive vaccine and some that don't. Your controls are not healthy in this case.

We would use Case or Control and then decorate that in parentheses with more information. So a healthy control would be Control (Healthy) whereas in the COVID-19 vaccination study we would have something like Case (Immunized) and Control (Not Immunized)

bussec commented 3 years ago

That's an important point, we have also struggled with this in the past (didn't really resolve it). Looking at NCIT again, a lot of these terms seem do come from CDISC and we are potentially not the first ones, who would like to have a light-weight version of this. OBI would have some encoding here: [OBI:0000097].

And I would prefer multiple fields instead of putting part of the information as a qualifier in parenthesis... seem hard to parse in the long term.

bcorrie commented 3 years ago

And I would prefer multiple fields instead of putting part of the information as a qualifier in parenthesis... seem hard to parse in the long term.

Absolutely - this just solved the immediate problem that we had with users trying to solve the problem with the fields we have currently... 8-)

If you look at the data handling for this paper: https://doi.org/10.1016/j.immuni.2020.12.011 in the section "Single-Cell Immune Repertoire Analysis" they describe how they found the data using iReceptor and the ADC API. They searched for subject.disease.disease_diagnosis.id == DOID:0080600 to find repertoires from subjects that had COVID-19 - which is exactly what you want. Precise, searching on an ontology ID. But to find healthy controls, they needed to do a search for subject.disease.study_group_description == "Control (Healthy)" which only works on our repositories (because it isn't in the spec and uses our curation process) The only way they knew to do this was to actually look at the data and infer that this was what this field means (in actual fact they consulted with us before making this decision). 8-)

We had another user ask how to do this this week (which drove me to ask this question) - so this is clearly something that is pretty important. People are using this to find data and we need to make sure this is described accurately so they aren't finding data that is not what they think it is!!!

schristley commented 3 years ago

There's also longitudinal data to consider in that subject may be Healthy at one time point but not-Healthy at a later time point, or vice versa. We have this use case right now with our cervical cancer study. This would suggest that a single Subject tag is not sufficient. We put multiple entries in diagnosis but the link to appropriate sample entries is tenuous.

bcorrie commented 3 years ago

Yes, the array of diagnoses per subject makes sense, but there is no real link to the sample processing. A single diagnosis might more appropriately be tied to the sample array from a subject. That way you still get an array of diagnoses per subject but you have the ability to capture a different diagnosis for each sample. The samples are gathered at time points...

Do we also not have the use case where the diagnosis is actually tied to the tissue? That is some sequence data might be tied to healthy tissue but other data might be tied to diseased tissue, all at the same time point???

schristley commented 3 years ago

Do we also not have the use case where the diagnosis is actually tied to the tissue? That is some sequence data might be tied to healthy tissue but other data might be tied to diseased tissue, all at the same time point???

disease_state_sample is used for that.

schristley commented 2 years ago

IEDB has the same issue and they've created an internal identifier

we use an internal identifier that we coined healthy ONTIE [ONTIE:0003423] we use "host health status" as the highest node and integrate disease ontology terms, healthy, infection without disease, and animal models of disease into a single owl file/tree view

bcorrie commented 10 months ago

Although the end user is trying to find this using the ADC API, the fundamental issue is an AIRR Standards issue - how to identify when a Repertoire is from a Healthy subject. So I think this is an AIRR 2.0 issue.

FWIW, I had another user ask this same question yesterday and I had to provide my "you can use the iReceptor internal curation protocol and its controlled vocabulary" to find some (but not all) Healthy Control repertoires in the ADC.

bcorrie commented 7 months ago

Are we considering this for resolution in v2.0? No real action on it since Aug 2021?

bussec commented 7 months ago

Going through this discussion again, there are clearly two pieces of information that distinct from each other

The diagnosis that are documented for a subject
The arm of the study a subject was placed in

As already discussed, a subject in the "control" group would often be without the disease in question (for an observational trial), but that does not exclude that the subject suffers from disease that were considered to be irrelevant for the purposes of the study. In addition, diagnoses might be unreported, but there is very little we could do to handle this.

So, given that we allow for multiple timestamped Diagnosis records (per #749), if we would move up study_group_description from Diagnosis to Subject and make it ontology controlled, would that solve the issue?

bcorrie commented 1 month ago

I have moved study_group_discussion to Subject, and changed Diagnosis.study_group_description to Diagnosis.diagnosis.description.

I also created a Subject.study_group as a controlled vocabulary containing currently "Case", "Control", "Control (Healthy)" as a placeholder for capturing this. So we might have:

Subject.study_group = Control (Healthy)
Subject.study_group_description = Healthy control (no T1D diagnosis at enrollment)
Diagnosis.diagnosis_description = Capture of T1D status at subject enrollment
Diagnosis.disease_diagnosis = null

Thoughts?

I suppose in the healthy control condition, there could be no diagnosis at all.

bcorrie commented 1 month ago

I used a controlled vocabulary for now, there may be an ontology that captures the study arm that we can use here?

airr-community / airr-standards

How do I find Healthy Controls using AIRR fields? #516