genome-nexus / genome-nexus-annotation-pipeline

Library and tool for annotating MAF files using Genome Nexus Webserver API
MIT License
8 stars 27 forks source link

Handling variants with common prefixes between GENIE and public GN #260

Open rmadupuri opened 1 year ago

rmadupuri commented 1 year ago

A few variants successfully annotate when pointed to the public GN (https://www.genomenexus.org/) but are failing when pointing to the Genie GN (https://genie.genomenexus.org/). The variants are passed in region format for genie and hgvsg format for public and the variants with common prefixes are handled differently in each case.

Below are a few examples that pass annotation when pointed to public but fail when pointed to genie site. test_failed_variants.txt

leexgh commented 1 year ago

I made a flowchart to show why there are some variants failing on GENIE genome nexus but can annotate successfully on public genome nexus (https://lucid.app/lucidchart/1424c03a-ec63-4b51-99ef-a3fcab1a600e/edit?viewport_loc=391%2C100%2C2208%2C1159%2C0_0&invitationId=inv_edc74010-8a66-4178-9ac8-0187072a9ebd).

Basically it's because the genomic coordinate doesn't match with the length of the reference allele. Besides insertion, all other types of variants should have the length of start to end equal to the length of reference allele (insertion variants should have end = start + 1). When we do annotation validation, we compare the given reference allele with annotated reference allele. So if the given genomic coordinate doesn't match the length of the reference allele, we will get annotated reference alleles based on the given genomic coordinate, it will be either longer or shorter so it won't be the same as the given reference allele.

There is a corner case on public genome nexus for only one reference allele and wrong end position, e.g. 3,183210442,183210443,C,CT, when we create follow-up query to validate the annotation, the follow-up query is created in hgvs format (3:g.183210442del ), which doesn't include wrong end position information in the query, so it could pass validation.

The solution I would propose is to harmonize the genomic location https://github.com/genome-nexus/genome-nexus/pull/701, it could also solve some other problems like missing end position.