Closed apriltuesday closed 2 months ago
I'm not sure this requires a notebook to review, here's what I've done.
First I confirmed that for the large recent submission, gene-related condition terms have been replaced with gene-related disorder terms. There's still a question about whether we should do anything about the "condition" terms, but for now I used the following regex: ^\S+-related disorder$
In the most recent ClinVar release, there are 122,432 records with preferred trait name matching this pattern but only 5,788 unique trait names. I think this makes sense given how broad the trait is (i.e. associated with many variants).
Of these trait names, only 1.4% have a MedGen ID within ClinVar, and only 0.1% have an exact EFO match. Here are all the EFO terms:
trait name | EFO |
---|---|
CBL-related disorder | http://purl.obolibrary.org/obo/MONDO_0013308 |
CLCN4-related disorder | http://www.ebi.ac.uk/efo/EFO_0009066 |
ATP6AP2-related disorder | http://purl.obolibrary.org/obo/MONDO_0100146 |
STAG1-related disorder | http://www.ebi.ac.uk/efo/EFO_0009078 |
COL4A1-related disorder | http://purl.obolibrary.org/obo/MONDO_0800461 |
DKC1-related disorder | http://purl.obolibrary.org/obo/MONDO_0100152 |
CTSC-related disorder | http://purl.obolibrary.org/obo/MONDO_0800465 |
Some of these may be of debatable utility in EFO but several look indeed legitimate, so I'm not sure about the decision to exclude this pattern entirely.
@tcezard Any thoughts? I was thinking of checking whether the variants involved in these records are associated with more specific traits as well (OT reports 99% of gene targets are covered by other evidence, but I don't think we know about variants). Is it worth doing this or should we be thinking of other strategies?
In case it's useful, I went ahead and checked whether variants in these "gene-related disorder" records are associated with other traits. For simplicity, I identified variants by VCV, which is ClinVar's variant identifier; this might not be 1:1 with chr_pos_ref_alt
but it shouldn't matter too much for these counts.
So while target genes might be overwhelmingly covered by other evidence, this is certainly not true for variants. Furthermore when variants are associated with multiple traits, these won't always be more specific than the gene-related disorder trait. Some examples can be found in this spreadsheet, which is filtered to include only VCVs associated with one of the EFO-mapped traits listed above, both to make the spreadsheet size manageable and to make it possible to look at terms within the EFO hierarchy (e.g. CBL-related disorder vs. rasopathy).
At this point the options are:
^\S+-related disorder$
. This is the simplest method but will remove existing informative evidence strings associated with 7 traits
Recent submissions to ClinVar have included a large number of gene-related conditions as trait names (e.g.
TTN-related condition
). These are less informative for our purposes, not likely to be mappable to good EFO terms and time-consuming for curators to sift through. As 99% of targets associated with these terms are already covered by other ClinVar records, we've decided to filter them out as uninformative.Tasks:
[0-9a-zA-Z]+-related .*
, are we removing informative trait names or large numbers of records from other submissions?)