EBIvariation / CMAT

ClinVar Mapping and Annotation Toolkit
Apache License 2.0
19 stars 10 forks source link

Filter generic gene-related condition terms from manual curation and evidence string generation #435

Closed apriltuesday closed 2 months ago

apriltuesday commented 4 months ago

Recent submissions to ClinVar have included a large number of gene-related conditions as trait names (e.g. TTN-related condition). These are less informative for our purposes, not likely to be mappable to good EFO terms and time-consuming for curators to sift through. As 99% of targets associated with these terms are already covered by other ClinVar records, we've decided to filter them out as uninformative.

Tasks:

apriltuesday commented 3 months ago

I'm not sure this requires a notebook to review, here's what I've done.

First I confirmed that for the large recent submission, gene-related condition terms have been replaced with gene-related disorder terms. There's still a question about whether we should do anything about the "condition" terms, but for now I used the following regex: ^\S+-related disorder$

In the most recent ClinVar release, there are 122,432 records with preferred trait name matching this pattern but only 5,788 unique trait names. I think this makes sense given how broad the trait is (i.e. associated with many variants).

Of these trait names, only 1.4% have a MedGen ID within ClinVar, and only 0.1% have an exact EFO match. Here are all the EFO terms:

trait name EFO
CBL-related disorder http://purl.obolibrary.org/obo/MONDO_0013308
CLCN4-related disorder http://www.ebi.ac.uk/efo/EFO_0009066
ATP6AP2-related disorder http://purl.obolibrary.org/obo/MONDO_0100146
STAG1-related disorder http://www.ebi.ac.uk/efo/EFO_0009078
COL4A1-related disorder http://purl.obolibrary.org/obo/MONDO_0800461
DKC1-related disorder http://purl.obolibrary.org/obo/MONDO_0100152
CTSC-related disorder http://purl.obolibrary.org/obo/MONDO_0800465

Some of these may be of debatable utility in EFO but several look indeed legitimate, so I'm not sure about the decision to exclude this pattern entirely.

@tcezard Any thoughts? I was thinking of checking whether the variants involved in these records are associated with more specific traits as well (OT reports 99% of gene targets are covered by other evidence, but I don't think we know about variants). Is it worth doing this or should we be thinking of other strategies?

apriltuesday commented 3 months ago

In case it's useful, I went ahead and checked whether variants in these "gene-related disorder" records are associated with other traits. For simplicity, I identified variants by VCV, which is ClinVar's variant identifier; this might not be 1:1 with chr_pos_ref_alt but it shouldn't matter too much for these counts.

So while target genes might be overwhelmingly covered by other evidence, this is certainly not true for variants. Furthermore when variants are associated with multiple traits, these won't always be more specific than the gene-related disorder trait. Some examples can be found in this spreadsheet, which is filtered to include only VCVs associated with one of the EFO-mapped traits listed above, both to make the spreadsheet size manageable and to make it possible to look at terms within the EFO hierarchy (e.g. CBL-related disorder vs. rasopathy).

tcezard commented 3 months ago

At this point the options are: