EBIvariation / CMAT

ClinVar Mapping and Annotation Toolkit
Apache License 2.0
17 stars 10 forks source link

Manual curation for 2024.06 release #423

Closed apriltuesday closed 2 months ago

apriltuesday commented 2 months ago

Refer to documentation for full description of steps.

Checklist:

apriltuesday commented 2 months ago

@Dona094 @tcezard The curation spreadsheet is now ready here.

There are a ton of terms like ttn-related condition, these seem to be from a single large submission in March (example)... I'm not sure if there's anything we should do besides curate these as usual, but let me know if you have a suggestion.

tcezard commented 2 months ago

I've performed the curation with limited success:

I have some reservation about the curating the gene related condition:

  1. There are 5754 traits that refer to gene associated/related condition
  2. Only 2 have a medgen concept associated
  3. When we can map to something useful it is a complicated process and the resulting mapping has little information (see example bellow)

I've looked in detail into the first of these condition (ttn-related condition) to get an idea of how to annotate these types of trait: ttn-related condition: mean that we're looking at any condition related to the TTN gene. Looking at the TTN gene on medline, it seems that the condition sassociated are all myopathies or dystrophies. TTN-related myopathy could be a good term to associated with but it only refers to myopathy although some of its children are distrophies as well. This term does not include all the conditions related to TTN in its children. For example: Medline mentions "Hereditary myopathy with early respiratory failure" that looks more like congenital myopathy 21 with early respiratory failure which is a sibling of TTN-related myopathy

The next one is even more complicated rai1-related condition: Medline indicates (https://medlineplus.gov/genetics/gene/rai1/ three conditions related to this gene

We could create new term that include the 3 conditions but again we can't be sure we captured all the potential conditions associated with that gene.

The process could also be somewhat automated:

Stepping back I'm also not convince of the value that these annotations will bring to Open Targets. The point of ClinVar is to associate variant with conditions/diseases. Here we only have an association between variants and a gene which we would have anyway using VEP. Finding or Creating the right ontology term will associate the variant with a very high level ontology term providing very little information to Open Targets

I suggest that we review this with Open Target and potentially with EFO.

M-casado commented 2 months ago

@apriltuesday & @tcezard & @Dona094 - I have taken a look at these Gene-related labels, and I believe it all has to do with a poor term curation at source and the lack of a parent ontology term.

In essence, these labels are a combination of (e.g.) CFTR mutation carrier status (EFO:0021794) and disease (EFO:0000408). They simply tell us that there is a phenotype and a gene variant, but not the relationships between them nor what the disease is. And without this part, I think the value of these annotations drops dramatically.

We cannot, and should not, infer subtypes from parental terms. In terms of knowledge representation, @tcezard already mentioned it above, we cannot be sure if the modifier related implies anything beyond the presence of the mutation. The problem here is that we are missing the sources' point of view when related is used. Why would a disease be "related" to a gene, unless they know that the gene variation is causing the phenotype? How do they assess that without a very thorough examination and sequencing? The easiest route would be to see a "common consequence of a gene variant" (e.g. myopathy) along a mutation in that gene (e.g. Titin), but if you (the source) know the phenotype, it's no longer a plain "condition". I feel like along the curation of the data, these bits are lost somewhere, and not submitted to ClinVar. There is a convergence point that just groups everything as "condition" and "related", and dumps everything to CV.

The way I see it, we have two alternatives:

  1. Create parent terms for the conditions. We could create "placeholder" terms in MONDO/EFO (if they allow for these), where anything "related" to a gene is allowed. I would advise against this, since we're not solving the issue, just allowing for it to continue to the next layer. The created term would "truly" represent the labels, but the labels would be barely informational.
  2. Don't propagate ambiguous annotations. There are lots of annotations in ClinVar that we have yet to curate and propagate to OT, and these would be added to that list. I would follow this route, and perhaps, given the humongous amount of these terms, contact the source for them to clarify and properly curate their submissions. Or, at least, to provide us with a reasoning on how to curate them.

There is a third option, I am very against, which is to map parent terms to subtypes. I noticed this was the case with ttn-related condition and cftr-related disorders. We should be careful with creating patches that would lead to semantic errors in the future, and may pass inadvertently under future manual curation rounds, given that they were once accepted by a curator.

Below there are some terms that I took a look at specifically, but the trend is easy to find in all others.

ttn-related condition

cftr-related disorders

pkd1-related condition

Interestingly, this is particular because the gene name is PKD1, but the gene comes from polycystic kidney disease 1, possibly hinting the disease. I wouldn't assume it has to be that disease alone, and thus I would advise again that "not doing something" is better than "doing something wrong".

tcezard commented 2 months ago

Thank you @M-casado. What you're saying is concordant with our conclusions. Our current course of action is to

We might actively filter those traits out in the future and OT will investigate if these association can be used anywhere.

I've also changed cftr-related disorders to SKIP.

apriltuesday commented 2 months ago

Thanks all, I'll email ClinVar about these, but for this round we'll ignore them.

apriltuesday commented 2 months ago

Export done and EFO issue created.