Closed andrewsu closed 10 months ago
Looks like the SmartAPI annotation for myvariant.info clinvar uses the omim id to identify the disease. If you look at the myvariant info link you sent, the diseases lack an omim id but do have other id types (ie. mondo) which are not utilized by the smartapi annotation. Looking at the melanoma result from the myvariant query you sent above:
{
"accession":"RCV000442563",
"clinical_significance":"Likely pathogenic",
"conditions":{
"identifiers":{
"human_phenotype_ontology":"HP:0007474",
"medgen":"C0025202",
"mesh":"D008545",
"mondo":"MONDO:0005105"
},
"name":"Melanoma"
},
...
}
Meanwhile, if you test other clinvar relations they seem to work on bte (ie. the example DBSNP:rs1193171808 -> OMIM:615592 given on the smartapi annotation)
Thanks @rjawesome for this careful diagnosis. Makes sense! So in addition to the OMIM mapping in the SmartAPI annotation in the x-bte-response-mapping
section, can we also add additional mappings for HPO, MESH, and MONDO so the original BRAF V600E -> melanoma example would also be retrieved by BTE?
It seems mesh and mondo are not indexed by myvariant so I don't know if that is queryable. Right now I have made a pull request to add HPO.
This can be partially addressed by adding more x-bte annotation (+ indexing fields if needed). However, this kind of "multiple prefixes/namespaces" issue is related to the #656
Added and deployed orphanet / hp operations . All operations passed manual testing, including clinvar-gene-phenoHP-rev
and clinvar-variant-phenoHP-rev
(affected by https://github.com/biothings/biothings_explorer/issues/756#issue-1969837588, which I described in the 2nd section of that comment).
However, this didn't address the original issue, because MyVariant's clinvar rcv entries for DBSNP:rs121913377 seem to use an HPO ID for melanoma that is wrong or outdated: HP:0007474
. Those entries don't have omim / orphanet fields either. (example in Rohan's previous comment).
I noticed multiple kinds of clinvar rcv disease IDs that seemed to be wrong / outdated, but I haven't checked [clinvar](https://www.ncbi.nlm.nih.gov/clinvar/) or OLS to see if the IDs are also wrong there (vs something going on in MyVariant parsing?). HPO: * [`HP:0007474` for melanoma](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.human_phenotype_ontology:%22HP:0007474%22&fields=clinvar,dbsnp.rsid) ([HPO search](https://hpo.jax.org/app/browse/search?q=HP:0007474&navFilter=all) brings up `HP:0002861` for melanoma instead) * [`HP:0200130` for Primary dilated cardiomyopathy (DCM)](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.human_phenotype_ontology:%22HP:0200130%22&fields=clinvar,dbsnp.rsid) ([HPO search](https://hpo.jax.org/app/browse/search?q=HP:0200130&navFilter=all) brings up `HP:0001644` for dilated cardiomyopathy instead) * [`HP:0005503` for Hemolytic anemia](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.human_phenotype_ontology:%22HP:0005503%22&fields=clinvar,dbsnp.rsid) ([HPO search](https://hpo.jax.org/app/browse/search?q=HP:0005503&navFilter=all) brings up `HP:0001878` instead) * [`HP:0005509` for Anemia](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.human_phenotype_ontology:%22HP:0005509%22&fields=clinvar,dbsnp.rsid) ([HPO search](https://hpo.jax.org/app/browse/search?q=HP:0005509&navFilter=all) brings up `HP:0001903` instead) Orphanet: * [ORPHANET:8378 for Autosomal recessive polycystic kidney disease (ARPKD)](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.orphanet:8378&fields=clinvar,dbsnp.rsid). Instead it looked like the ID should be [ORPHANET:731](https://www.orpha.net/consor/cgi-bin/OC_Exp.php?lng=EN&Expert=731) and `8378` is the GARD ID? * [ORPHANET:178330 for Heinz body anemia](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.orphanet:178330&fields=clinvar,dbsnp.rsid). The OLS search shows that [this ID was obsoleted](https://www.ebi.ac.uk/ols4/ontologies/ordo/classes?iri=http%3A%2F%2Fwww.orpha.net%2FORDO%2FOrphanet_178330). MedGen: * [CN517202: name field says "not provided"](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.medgen:CN517202&fields=clinvar,dbsnp.rsid). [Able to find in MedGen as an outdated term for "not provided"](https://www.ncbi.nlm.nih.gov/medgen/?term=CN517202) (would want to filter out during queries?) * some IDs seem to be UMLS rather than MedGen (which are numeric): * [C0025202 for Melanoma](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.medgen:C0025202&fields=clinvar,dbsnp.rsid): entry in MedGen shows it's a UMLS ID and the MedGen UID is `9944` * [C0005283 for beta Thalassemia (BTHAL)](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.medgen:C0005283&fields=clinvar,dbsnp.rsid): entry in MedGen shows it's a UMLS ID and the MedGen UID is `2611`
Another issue is that this set of operations (omim, orphanet, hp) only covers 48% of the dataset (1038239 / 2162597)
_exists_
queries don't work...so it's unclear how much they'll improve the situationThe mondo/mesh namespaces have now been indexed https://github.com/biothings/myvariant.info/issues/175#issuecomment-1850837835 and I added x-bte operations to cover them https://github.com/NCATS-Tangerine/translator-api-registry/commit/d4228a74dcba283840efe66c8aaefe3de956ac85
Now:
send a POST request to the api-specific endpoint, MyVariant only. Like http://localhost:3000/v1/smartapi/09c8782d9f4027712e65b95424adba79/query. ``` { "message": { "query_graph": { "nodes": { "n0": { "ids":["DBSNP:rs121913377"], "categories":["biolink:SequenceVariant"] }, "n1": { "categories":["biolink:Disease"] } }, "edges": { "e1": { "subject": "n0", "object": "n1" } } } } } ``` Response will have this edge from clinvar connecting BRAF V600E to melanoma (MONDO:0005105). ``` "2db8263141031ac84d9fea9c457ebba6": { "predicate": "biolink:related_to", "subject": "DBSNP:rs121913377", "object": "MONDO:0005105", "attributes": [], "sources": [ { "resource_id": "infores:clinvar", "resource_role": "primary_knowledge_source" }, { "resource_id": "infores:myvariant-info", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": [ "infores:clinvar" ] }, { "resource_id": "infores:service-provider-trapi", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": [ "infores:myvariant-info" ] } ] }, ```
Last thing to do before closing this issue is to investigate the odd IDs (from the previous post)
On MyVariant clinvar data's melanoma identifiers:
``` { "identifiers": { "human_phenotype_ontology": "HP:0007474", "medgen": "C0025202", "mesh": "D008545", "mondo": "MONDO:0005105" }, "name": "Melanoma" } ```
I was concerned with the HP ID HP:0007474
(which SRI Node Norm didn't recognize) and the MedGen ID C0025202
, which seems to be a UMLS ID. While I'd need to find the exact line(s) in the clinvar data file to confirm what is happening, I think these "odd" IDs come from the Clinvar data file and aren't necessarily "wrong".
* I didn't find the melanoma IDs in the [clinvar page for BRAF V600E](https://www.ncbi.nlm.nih.gov/clinvar/variation/376069/#id_second). However, after clicking on the "Conditions" tab (near "Variant details" and "Gene(s)"), I got to the "Variation/condition record" page [RCV000442563.1](https://www.ncbi.nlm.nih.gov/clinvar/RCV000442563.1/), where the melanoma IDs are: > MONDO: MONDO:0005105; MeSH: D008545; MedGen: [C0025202](https://www.ncbi.nlm.nih.gov/medgen/C0025202); Human Phenotype Ontology: [HP:0002861](http://www.human-phenotype-ontology.org/hpoweb/showterm?id=HP:0002861) * That's the same MedGen ID as in MyVariant...so maybe the original clinvar data file is also using this ID. But it's still confusing because that [MedGen ID's page](https://www.ncbi.nlm.nih.gov/medgen/C0025202) says `C0025202` is the UMLS concept ID, vs the `MedGen UID: 9944` * The HP ID is different from the one in MyVariant: `HP:0002861`. This seems to be the proper HP ID for melanoma. * However, when I [look up this ID in BioPortal](https://purl.bioontology.org/ontology/HP?conceptid=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FHP_0002861), I see the ID in MyVariant, `HP:0007474`, as an alternative ID. But I don't know what an "alternative ID" means (perhaps that it's deprecated and shouldn't be used anymore?). * So...maybe the original clinvar data file uses this alternative ID? (Is it possible that the clinvar data file did change at some point to use the proper ID and MyVariant didn't recognize/incorporate the change?)
MyVariant is using: * [`HP:0200130` for Primary dilated cardiomyopathy (DCM)](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.human_phenotype_ontology:%22HP:0200130%22&fields=clinvar,dbsnp.rsid) -> this is an alternative ID for [`HP:0001644` / dilated cardiomyopathy](https://purl.bioontology.org/ontology/HP?conceptid=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FHP_0001644) * [`HP:0005503` for Hemolytic anemia](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.human_phenotype_ontology:%22HP:0005503%22&fields=clinvar,dbsnp.rsid) -> this is an alternative ID for [`HP:0001878`](https://purl.bioontology.org/ontology/HP?conceptid=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FHP_0001878) * [`HP:0005509` for Anemia](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.human_phenotype_ontology:%22HP:0005509%22&fields=clinvar,dbsnp.rsid) -> this is an alternative ID for [`HP:0001903`](https://purl.bioontology.org/ontology/HP?conceptid=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FHP_0001903)
I wonder if MyVariant can map these alternative IDs to their proper/main IDs, and use the proper/main IDs instead...
Regarding the "odd" orphanet IDs I saw in MyVariant's clinvar data...they probably come from the original clinvar data. But I wonder if it's possible to keep > 1 ID for a namespace, in cases where the clinvar data may provide multiple (see Example 1 where clinvar probably provides 2 IDs and one is correct).
**Example 1**: MyVariant is using `orphanet:8378` for Autosomal recessive polycystic kidney disease (ARPKD). * This [MyVariant record for rs1201981092 and ARPKD](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.orphanet:8378%20AND%20dbsnp.rsid:rs1201981092&fields=clinvar,dbsnp.rsid) matches this [RCV record](https://www.ncbi.nlm.nih.gov/clinvar/RCV002260544.1/), where there's **two** orphanet IDs: 731 (which is [correct](https://www.orpha.net/consor/cgi-bin/OC_Exp.php?lng=EN&Expert=731)) and 8378 (which is wrong, and maybe the GARD ID instead). **Example 2**: MyVariant is using `orphanet:178330` for Heinz body anemia. * These [MyVariant records for rs41323248 and Heinz body anemia](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.orphanet:178330%20AND%20dbsnp.rsid:rs41323248&fields=clinvar,dbsnp.rsid) match these [two](https://www.ncbi.nlm.nih.gov/clinvar/RCV001420422.2/) [RCV](https://www.ncbi.nlm.nih.gov/clinvar/RCV001420421.2/) records. For both, the orphanet ID is 178330...even though the [orphanet website](https://www.orpha.net/consor/cgi-bin/OC_Exp.php?lng=EN&Expert=178330) shows that this ID has been obsoleted, and perhaps ["rare hemolytic anemia"](https://www.orpha.net/consor/cgi-bin/OC_Exp.php?lng=EN&Expert=98363) should be used instead.
Finally, on the "odd" medgen IDs I saw in MyVariant's clinvar data...they probably come from the original clinvar data. BTE isn't using medgen namespaces because Translator doesn't seem to support it yet (biolink-model, node norm).
But I wonder:
MyVariant is using medgen `CN517202` for the condition "not provided" [in >800k records](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.medgen:CN517202&fields=clinvar,dbsnp.rsid). This MedGen ID seems to be [outdated](https://www.ncbi.nlm.nih.gov/medgen/?term=CN517202), replaced by C3661900. * This [MyVariant record for CLINVAR:686861 and "not provided"](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.medgen:CN517202%20AND%20clinvar.variant_id:686861&fields=clinvar,dbsnp.rsid) matches this [RCV record](https://www.ncbi.nlm.nih.gov/clinvar/RCV000849437/), which also uses `CN517202`. * But there's lots of cases where the RCV record is using the new ID C3661900 but MyVariant's info is using the outdated `CN517202`. For example, compare this [MyVariant record](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.medgen:CN517202%20AND%20dbsnp.rsid:rs1388158885&fields=clinvar,dbsnp.rsid) to [one of its RCV pages](https://www.ncbi.nlm.nih.gov/clinvar/RCV001693296/). * MyVariant isn't using the new ID C3661900 ([no matching records](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.medgen:C3661900&fields=clinvar,dbsnp.rsid))
Found the same situation as above with the [melanoma medgen ID](https://github.com/biothings/biothings_explorer/issues/548#issuecomment-1884345434). The [MyVariant record for rs1847557333 and beta Thalassemia (BTHAL)](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.medgen:C0005283%20AND%20clinvar.rsid:rs1847557333&fields=clinvar,dbsnp.rsid) match this [RCV record](https://www.ncbi.nlm.nih.gov/clinvar/RCV001078308/) - which is using the same MedGen ID. So maybe the original clinvar data file is also using this ID. But it's still confusing because that [MedGen ID's page](https://www.ncbi.nlm.nih.gov/medgen/C0005283) says C0005283 is the UMLS concept ID, vs the MedGen UID: 2611
So to summarize my ideas after the MyVariant clinvar disease ID analyses I did (above posts)...
I suspect the "odd" IDs are coming from the original clinvar ingest data.
But I wonder if there are parser changes that could help:
Closing this issue because the original problem has been addressed with mondo/mesh namespace coverage.
As for the "odd MyVariant clinvar disease IDs" (summary in previous post):
Clinvar contains relationships between genetic variants and diseases (e.g., BRAF V600E -> melanoma), and that relationship appears to be captured in myvariant.info (e.g., http://myvariant.info/v1/variant/rs121913377). But I can't get this relationship via BTE when querying using any of these identifiers:
Note the DBSNP query gets results based on CIViC and Disgenet, but not clinvar. It appears that the clinvar fields are captured in the myvariant.info smartAPI annotation, but I can't quite figure out why those results aren't being captured.
TRAPI Query template
``` { "message": { "query_graph": { "nodes": { "n0": { "ids": ["CLINVAR:362948"], "categories": ["biolink:SequenceVariant"] }, "n1": { "categories": ["biolink:DiseaseOrPhenotypicFeature"] } }, "edges": { "t_edge": { "object": "n1", "subject": "n0" } } } } } ```