biothings / myvariant.info

MyVariant.info: A BioThings API for human variant annotations
http://myvariant.info
Other
87 stars 32 forks source link

API accepts invalid HGVS and rejects valid HGVS #109

Open reece opened 3 years ago

reece commented 3 years ago

First: Thank you. Making data easily available is a great service to the community!

Issue

MyVariant.info accepts invalid (MyVariant.info-specific) HGVS and does not accept valid HGVS. Since HGVS expressions are the most convenient keys for lookup, it would be very helpful for MyVariant.info to adopt more standard HGVS expressions.

Examples

# Example from http://myvariant.info/v1/api#/variant/get_variant__variantid_ 
snafu$ curl -s -H "accept: application/json" -X GET "https://myvariant.info/v1/variant/chr6%3Ag.152708291G%3EA" 
{"_id": "chr6:g.152708291G>A", "_version": 2, "cad ...

That reply includes several HGVS expressions. Let's use the GRCh37 expression as a query:

snafu$ curl -s -H "accept: application/json" -X GET "https://myvariant.info/v1/variant/NC_000006.11:g.152708291G>A"
{"code": 404, "success": false, "error": "ID 'NC_0 ...

Discussion

Using chr6 as reference sequence in an HGVS expression is specifically disallowed by the specification because it is ambiguous. Although the MyVariant.info API accepts an assembly query parameter, that parameter is not required. (P.S. Also, better to use the official GRCh names.) Therefore, MyVariant.info uses invalid HGVS for lookup, and that invalid HGVS creates ambiguity that does not exist with standards-compliant names.

Furthermore, MyVariant.info does not accept conventional HGVS expressions such as NC_000006.11:g.152708291G>A, NC_000006.12:g.152387156G>A, NG_012855.1:g.255244C>T, or NG_012855.2:g.255244C>T, all of which are returned by MyVariant.info when using the chr6-based identifier. As a user, it is surprising to return HGVS identifiers that are not also accepted as HGVS search terms.

P.S. I'm willing to help. Host a hackathon and I'll lend a hand.

newgene commented 3 years ago

Hi @reece thanks for bringing up this issue. What's missing for myvariant.info is a way to compute all equivalent HGVS names for a give (chr, pos, ref, alt) set. If such a tool is faster enough, we can integrate it on the runtime, otherwise, we can pre-compute and index them, so that we can enable queries using all valid HGVS names.

We are aware of this issue, but did not find a suitable tool yet. Any suggestion from you?

I also know ClinGen Allele Registry (e.g. this one) has all these HGVS names, but not sure what's the tool behind it.

newgene commented 3 years ago

Regarding the variant id MyVariant.info is using, we know it's in-perfect. We always hope the community will come up a way to define canonical variant id. We started to evaluate VRS data model from GA4GH now (https://github.com/biothings/myvariant.info/issues/108), looks like should be a good solution:

reece commented 3 years ago

@newgene: My first recommendation would be not use (chr,pos,ref,alt) at all because it's under-specified. Instead, I presume (without looking at the code) that replacing chr with an accession would be extremely easy, AND it would allow you to index alt assembly fragments, transcripts, and perhaps even protein sequence. To be clear, I mean that you should may expressions like (chr,pos,ref,alt) with the explicit assembly (but not the implicit assembly) to (ac,pos,ref,alt). This change alone would be a great first step.

Then, as a second effort, you should be able to index at least the HGVS expressions that are currently returned. That would ensure that all of the HGVS expressions may be used to retrieve variant info. (Gotcha: some variants, especially protein variants, will resolve to multiple genomic and transcript variants.)

As an author of VRS, I'd certainly be thrilled for you to adopt VRS ids! I'd be happy to talk that through.

Also, I want to reiterate that myvariant.info is lovely and I appreciate the effort that it took to pull this together.

newgene commented 3 years ago

@reece An id like chr6:g.152708291G>A pretty much serves as an internal primary key for the variant object. We avoid using NC_000006.11:g.152708291G>A because it's not stable enough (.ver could change, like NC_000006.12:g.152708291G>A in the future). Agree that we need a better primary key for variant objects and VRS ids look promising.

For the user-facing query interface, we can certainly do some conversion to map NC_000006.x to chr6, so the query with genomic accession based hgvs name (e.g. NC_000006.11:g.152708291G>A ) will work. however, coding sequence accession based hgvs name (e.g. NM_033071.3:c.8424C>T) will be harder to include unless there is a source for us to obtain or pre-compute the validate hgvs names.

Having said that, we already have some data sources providing a list of equivalent hgvs names (e.g. clinvar.hgvs, civic.hgvs_expressions fields, more at http://myvariant.info/metadata/fields and search for "hgvs"), so you can already query for variants by those recorded hgvs names (not on /v1/variant endpoint yet):

http://myvariant.info/v1/query?q="NC_000006.11:g.152708291G>A"&fields=snpeff,clinvar http://myvariant.info/v1/query?q="NM_033071.3:c.8424C>T"&fields=snpeff,clinvar

But we currently don't have a good systematic way to obtain all (at least commonly used) hgvs names.