biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
10 stars 11 forks source link

investigate why BTE doesn't retrieve variant-disease relations from clinvar #548

Closed andrewsu closed 10 months ago

andrewsu commented 1 year ago

Clinvar contains relationships between genetic variants and diseases (e.g., BRAF V600E -> melanoma), and that relationship appears to be captured in myvariant.info (e.g., http://myvariant.info/v1/variant/rs121913377). But I can't get this relationship via BTE when querying using any of these identifiers:

Note the DBSNP query gets results based on CIViC and Disgenet, but not clinvar. It appears that the clinvar fields are captured in the myvariant.info smartAPI annotation, but I can't quite figure out why those results aren't being captured.

TRAPI Query template ``` { "message": { "query_graph": { "nodes": { "n0": { "ids": ["CLINVAR:362948"], "categories": ["biolink:SequenceVariant"] }, "n1": { "categories": ["biolink:DiseaseOrPhenotypicFeature"] } }, "edges": { "t_edge": { "object": "n1", "subject": "n0" } } } } } ```
rjawesome commented 1 year ago

Looks like the SmartAPI annotation for myvariant.info clinvar uses the omim id to identify the disease. If you look at the myvariant info link you sent, the diseases lack an omim id but do have other id types (ie. mondo) which are not utilized by the smartapi annotation. Looking at the melanoma result from the myvariant query you sent above:

{
   "accession":"RCV000442563",
   "clinical_significance":"Likely pathogenic",
   "conditions":{
      "identifiers":{
         "human_phenotype_ontology":"HP:0007474",
         "medgen":"C0025202",
         "mesh":"D008545",
         "mondo":"MONDO:0005105"
      },
      "name":"Melanoma"
   },
   ...
}

Meanwhile, if you test other clinvar relations they seem to work on bte (ie. the example DBSNP:rs1193171808 -> OMIM:615592 given on the smartapi annotation)

andrewsu commented 1 year ago

Thanks @rjawesome for this careful diagnosis. Makes sense! So in addition to the OMIM mapping in the SmartAPI annotation in the x-bte-response-mapping section, can we also add additional mappings for HPO, MESH, and MONDO so the original BRAF V600E -> melanoma example would also be retrieved by BTE?

rjawesome commented 1 year ago

It seems mesh and mondo are not indexed by myvariant so I don't know if that is queryable. Right now I have made a pull request to add HPO.

colleenXu commented 1 year ago

This can be partially addressed by adding more x-bte annotation (+ indexing fields if needed). However, this kind of "multiple prefixes/namespaces" issue is related to the #656

colleenXu commented 11 months ago

Notes on the current situation

Added and deployed orphanet / hp operations . All operations passed manual testing, including clinvar-gene-phenoHP-rev and clinvar-variant-phenoHP-rev (affected by https://github.com/biothings/biothings_explorer/issues/756#issue-1969837588, which I described in the 2nd section of that comment).

However, this didn't address the original issue, because MyVariant's clinvar rcv entries for DBSNP:rs121913377 seem to use an HPO ID for melanoma that is wrong or outdated: HP:0007474. Those entries don't have omim / orphanet fields either. (example in Rohan's previous comment).

more examples of strange IDs (HPO, Orphanet, MedGen)

I noticed multiple kinds of clinvar rcv disease IDs that seemed to be wrong / outdated, but I haven't checked [clinvar](https://www.ncbi.nlm.nih.gov/clinvar/) or OLS to see if the IDs are also wrong there (vs something going on in MyVariant parsing?). HPO: * [`HP:0007474` for melanoma](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.human_phenotype_ontology:%22HP:0007474%22&fields=clinvar,dbsnp.rsid) ([HPO search](https://hpo.jax.org/app/browse/search?q=HP:0007474&navFilter=all) brings up `HP:0002861` for melanoma instead) * [`HP:0200130` for Primary dilated cardiomyopathy (DCM)](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.human_phenotype_ontology:%22HP:0200130%22&fields=clinvar,dbsnp.rsid) ([HPO search](https://hpo.jax.org/app/browse/search?q=HP:0200130&navFilter=all) brings up `HP:0001644` for dilated cardiomyopathy instead) * [`HP:0005503` for Hemolytic anemia](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.human_phenotype_ontology:%22HP:0005503%22&fields=clinvar,dbsnp.rsid) ([HPO search](https://hpo.jax.org/app/browse/search?q=HP:0005503&navFilter=all) brings up `HP:0001878` instead) * [`HP:0005509` for Anemia](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.human_phenotype_ontology:%22HP:0005509%22&fields=clinvar,dbsnp.rsid) ([HPO search](https://hpo.jax.org/app/browse/search?q=HP:0005509&navFilter=all) brings up `HP:0001903` instead) Orphanet: * [ORPHANET:8378 for Autosomal recessive polycystic kidney disease (ARPKD)](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.orphanet:8378&fields=clinvar,dbsnp.rsid). Instead it looked like the ID should be [ORPHANET:731](https://www.orpha.net/consor/cgi-bin/OC_Exp.php?lng=EN&Expert=731) and `8378` is the GARD ID? * [ORPHANET:178330 for Heinz body anemia](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.orphanet:178330&fields=clinvar,dbsnp.rsid). The OLS search shows that [this ID was obsoleted](https://www.ebi.ac.uk/ols4/ontologies/ordo/classes?iri=http%3A%2F%2Fwww.orpha.net%2FORDO%2FOrphanet_178330). MedGen: * [CN517202: name field says "not provided"](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.medgen:CN517202&fields=clinvar,dbsnp.rsid). [Able to find in MedGen as an outdated term for "not provided"](https://www.ncbi.nlm.nih.gov/medgen/?term=CN517202) (would want to filter out during queries?) * some IDs seem to be UMLS rather than MedGen (which are numeric): * [C0025202 for Melanoma](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.medgen:C0025202&fields=clinvar,dbsnp.rsid): entry in MedGen shows it's a UMLS ID and the MedGen UID is `9944` * [C0005283 for beta Thalassemia (BTHAL)](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.medgen:C0005283&fields=clinvar,dbsnp.rsid): entry in MedGen shows it's a UMLS ID and the MedGen UID is `2611`

Another issue is that this set of operations (omim, orphanet, hp) only covers 48% of the dataset (1038239 / 2162597)

Possible next steps

colleenXu commented 10 months ago

The mondo/mesh namespaces have now been indexed https://github.com/biothings/myvariant.info/issues/175#issuecomment-1850837835 and I added x-bte operations to cover them https://github.com/NCATS-Tangerine/translator-api-registry/commit/d4228a74dcba283840efe66c8aaefe3de956ac85

Now:

original query and current response

send a POST request to the api-specific endpoint, MyVariant only. Like http://localhost:3000/v1/smartapi/09c8782d9f4027712e65b95424adba79/query. ``` { "message": { "query_graph": { "nodes": { "n0": { "ids":["DBSNP:rs121913377"], "categories":["biolink:SequenceVariant"] }, "n1": { "categories":["biolink:Disease"] } }, "edges": { "e1": { "subject": "n0", "object": "n1" } } } } } ``` Response will have this edge from clinvar connecting BRAF V600E to melanoma (MONDO:0005105). ``` "2db8263141031ac84d9fea9c457ebba6": { "predicate": "biolink:related_to", "subject": "DBSNP:rs121913377", "object": "MONDO:0005105", "attributes": [], "sources": [ { "resource_id": "infores:clinvar", "resource_role": "primary_knowledge_source" }, { "resource_id": "infores:myvariant-info", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": [ "infores:clinvar" ] }, { "resource_id": "infores:service-provider-trapi", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": [ "infores:myvariant-info" ] } ] }, ```


Last thing to do before closing this issue is to investigate the odd IDs (from the previous post)

colleenXu commented 10 months ago

On MyVariant clinvar data's melanoma identifiers:

I think the identifier set is the same between records (variants)

``` { "identifiers": { "human_phenotype_ontology": "HP:0007474", "medgen": "C0025202", "mesh": "D008545", "mondo": "MONDO:0005105" }, "name": "Melanoma" } ```

I was concerned with the HP ID HP:0007474 (which SRI Node Norm didn't recognize) and the MedGen ID C0025202, which seems to be a UMLS ID. While I'd need to find the exact line(s) in the clinvar data file to confirm what is happening, I think these "odd" IDs come from the Clinvar data file and aren't necessarily "wrong".

What I found

* I didn't find the melanoma IDs in the [clinvar page for BRAF V600E](https://www.ncbi.nlm.nih.gov/clinvar/variation/376069/#id_second). However, after clicking on the "Conditions" tab (near "Variant details" and "Gene(s)"), I got to the "Variation/condition record" page [RCV000442563.1](https://www.ncbi.nlm.nih.gov/clinvar/RCV000442563.1/), where the melanoma IDs are: > MONDO: MONDO:0005105; MeSH: D008545; MedGen: [C0025202](https://www.ncbi.nlm.nih.gov/medgen/C0025202); Human Phenotype Ontology: [HP:0002861](http://www.human-phenotype-ontology.org/hpoweb/showterm?id=HP:0002861) * That's the same MedGen ID as in MyVariant...so maybe the original clinvar data file is also using this ID. But it's still confusing because that [MedGen ID's page](https://www.ncbi.nlm.nih.gov/medgen/C0025202) says `C0025202` is the UMLS concept ID, vs the `MedGen UID: 9944` * The HP ID is different from the one in MyVariant: `HP:0002861`. This seems to be the proper HP ID for melanoma. * However, when I [look up this ID in BioPortal](https://purl.bioontology.org/ontology/HP?conceptid=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FHP_0002861), I see the ID in MyVariant, `HP:0007474`, as an alternative ID. But I don't know what an "alternative ID" means (perhaps that it's deprecated and shouldn't be used anymore?). * So...maybe the original clinvar data file uses this alternative ID? (Is it possible that the clinvar data file did change at some point to use the proper ID and MyVariant didn't recognize/incorporate the change?)


All the other "odd" HP IDs I saw in MyVariant's clinvar data are also alternative IDs

MyVariant is using: * [`HP:0200130` for Primary dilated cardiomyopathy (DCM)](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.human_phenotype_ontology:%22HP:0200130%22&fields=clinvar,dbsnp.rsid) -> this is an alternative ID for [`HP:0001644` / dilated cardiomyopathy](https://purl.bioontology.org/ontology/HP?conceptid=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FHP_0001644) * [`HP:0005503` for Hemolytic anemia](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.human_phenotype_ontology:%22HP:0005503%22&fields=clinvar,dbsnp.rsid) -> this is an alternative ID for [`HP:0001878`](https://purl.bioontology.org/ontology/HP?conceptid=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FHP_0001878) * [`HP:0005509` for Anemia](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.human_phenotype_ontology:%22HP:0005509%22&fields=clinvar,dbsnp.rsid) -> this is an alternative ID for [`HP:0001903`](https://purl.bioontology.org/ontology/HP?conceptid=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FHP_0001903)

I wonder if MyVariant can map these alternative IDs to their proper/main IDs, and use the proper/main IDs instead...

colleenXu commented 10 months ago

Regarding the "odd" orphanet IDs I saw in MyVariant's clinvar data...they probably come from the original clinvar data. But I wonder if it's possible to keep > 1 ID for a namespace, in cases where the clinvar data may provide multiple (see Example 1 where clinvar probably provides 2 IDs and one is correct).

Examples

**Example 1**: MyVariant is using `orphanet:8378` for Autosomal recessive polycystic kidney disease (ARPKD). * This [MyVariant record for rs1201981092 and ARPKD](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.orphanet:8378%20AND%20dbsnp.rsid:rs1201981092&fields=clinvar,dbsnp.rsid) matches this [RCV record](https://www.ncbi.nlm.nih.gov/clinvar/RCV002260544.1/), where there's **two** orphanet IDs: 731 (which is [correct](https://www.orpha.net/consor/cgi-bin/OC_Exp.php?lng=EN&Expert=731)) and 8378 (which is wrong, and maybe the GARD ID instead). **Example 2**: MyVariant is using `orphanet:178330` for Heinz body anemia. * These [MyVariant records for rs41323248 and Heinz body anemia](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.orphanet:178330%20AND%20dbsnp.rsid:rs41323248&fields=clinvar,dbsnp.rsid) match these [two](https://www.ncbi.nlm.nih.gov/clinvar/RCV001420422.2/) [RCV](https://www.ncbi.nlm.nih.gov/clinvar/RCV001420421.2/) records. For both, the orphanet ID is 178330...even though the [orphanet website](https://www.orpha.net/consor/cgi-bin/OC_Exp.php?lng=EN&Expert=178330) shows that this ID has been obsoleted, and perhaps ["rare hemolytic anemia"](https://www.orpha.net/consor/cgi-bin/OC_Exp.php?lng=EN&Expert=98363) should be used instead.

colleenXu commented 10 months ago

Finally, on the "odd" medgen IDs I saw in MyVariant's clinvar data...they probably come from the original clinvar data. BTE isn't using medgen namespaces because Translator doesn't seem to support it yet (biolink-model, node norm).

But I wonder:

Example 1: medgen CN517202 for "not provided"

MyVariant is using medgen `CN517202` for the condition "not provided" [in >800k records](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.medgen:CN517202&fields=clinvar,dbsnp.rsid). This MedGen ID seems to be [outdated](https://www.ncbi.nlm.nih.gov/medgen/?term=CN517202), replaced by C3661900. * This [MyVariant record for CLINVAR:686861 and "not provided"](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.medgen:CN517202%20AND%20clinvar.variant_id:686861&fields=clinvar,dbsnp.rsid) matches this [RCV record](https://www.ncbi.nlm.nih.gov/clinvar/RCV000849437/), which also uses `CN517202`. * But there's lots of cases where the RCV record is using the new ID C3661900 but MyVariant's info is using the outdated `CN517202`. For example, compare this [MyVariant record](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.medgen:CN517202%20AND%20dbsnp.rsid:rs1388158885&fields=clinvar,dbsnp.rsid) to [one of its RCV pages](https://www.ncbi.nlm.nih.gov/clinvar/RCV001693296/). * MyVariant isn't using the new ID C3661900 ([no matching records](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.medgen:C3661900&fields=clinvar,dbsnp.rsid))

Example 2: MyVariant is using medgen C0005283 for beta Thalassemia (BTHAL)

Found the same situation as above with the [melanoma medgen ID](https://github.com/biothings/biothings_explorer/issues/548#issuecomment-1884345434). The [MyVariant record for rs1847557333 and beta Thalassemia (BTHAL)](https://myvariant.info/v1/query?q=clinvar.rcv.conditions.identifiers.medgen:C0005283%20AND%20clinvar.rsid:rs1847557333&fields=clinvar,dbsnp.rsid) match this [RCV record](https://www.ncbi.nlm.nih.gov/clinvar/RCV001078308/) - which is using the same MedGen ID. So maybe the original clinvar data file is also using this ID. But it's still confusing because that [MedGen ID's page](https://www.ncbi.nlm.nih.gov/medgen/C0005283) says C0005283 is the UMLS concept ID, vs the MedGen UID: 2611

colleenXu commented 10 months ago

So to summarize my ideas after the MyVariant clinvar disease ID analyses I did (above posts)...

I suspect the "odd" IDs are coming from the original clinvar ingest data.

But I wonder if there are parser changes that could help:

colleenXu commented 10 months ago

Closing this issue because the original problem has been addressed with mondo/mesh namespace coverage.


As for the "odd MyVariant clinvar disease IDs" (summary in previous post):

(also see lab Slack convo here)