Illumina / Nirvana

The nimble & robust variant annotator
https://illumina.github.io/NirvanaDocumentation/
GNU General Public License v3.0
167 stars 44 forks source link

Wrong HGVSP annotation for some RefSeq sequences for GRCh37 #103

Open heseber opened 1 year ago

heseber commented 1 year ago

Example: NC_000003.11:g.49928691T>C Corrected: NC_000017.10:g.43350892G>A

This is annotated for RefSeq as a missense variant NP_003945.2:p.(Ala545Val) with codons gCc/gTc, which is wrong because the frame is erroneously off by 1. The true change is ggC/ggT, which is a synonymous mutation Gly545Gly.

heseber commented 1 year ago

This affects MAP3K14, and it has been reported to Illumina before in September 2022, as I just learned after talking to a colleague. This also affects the result generated by LocalApp for the TSO500 and TSO COMP panels.

MichaelStromberg commented 1 year ago

Hi Henrik,

Did you provide the correct HGVS g. notation?

Annotating that variant with Nirvana and Biocommons HGVS reveals overlaps with MST1R.

Looking at the latest version of RefSeq for GRCh37, I would expect to see MAP3K14 on chr17. Here's the GFF line for the canonical transcript (NM_003954.5/NP_003945.2):

NC_000017.10    BestRefSeq      mRNA    43340486        43394386        .       -       .       ID=rna-NM_003954.5;Parent=gene-MAP3K14;Dbxref=GeneID:9020,Genbank:NM_003954.5,HGNC:HGNC:6853,MIM:604655;Name=NM_003954.5;Note=The RefSeq transcript has 1 frameshift compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=MAP3K14;inference=similar to RNA sequence%2C mRNA (same species):RefSeq:NM_003954.5;product=mitogen-activated protein kinase kinase kinase 14;tag=RefSeq Select;transcript_id=NM_003954.5
MichaelStromberg commented 1 year ago

Here's the annotation for NC_000003.11:g.49928691T>C that we currently get with the latest internal version of Nirvana (3.20):

{
   "transcript":"NM_002447.4",
   "source":"RefSeq",
   "bioType":"mRNA",
   "codons":"Ggc/Ggc",
   "aminoAcids":"G",
   "cdnaPos":"3847",
   "cdsPos":"3583",
   "exons":"17/20",
   "proteinPos":"1195",
   "geneId":"4486",
   "hgnc":"MST1R",
   "consequence":[
      "synonymous_variant"
   ],
   "hgvsc":"NM_002447.4:c.3583=",
   "hgvsp":"NP_002438.2:p.(Gly1195=)",
   "isCanonical":true,
   "proteinId":"NP_002438.2"
}
heseber commented 1 year ago

Hi Michael, you are right, I mixed up the HGVSG for two cases that I looked at. MST1R has another issue, but that's due to a difference between the genomic backbone for GRCh37 vs GRCh38 (GRCh37 has one base which is a SNP, which is replaced by the common variant in GRCh38, so the hgvsp derived from Nirvana+GRCh37 does not match the NP sequence - NP has an AA according to GRCh38, where it is a synonymous mutation, while translating from GRCh37 results in a missense mutation). It is interesting that with your internal version 3.20 you get the synonymous mutation for MST1R, you must have corrected something here. With 3.18.1, which is the latest public release, this looks different. The correct HGVSG for MAP3K14 is this: NC_000017.10:g.43350892G>A, but also two other mutations are affected: NC_000017.10:g.43342141G>C and NC_000017.10:g.43344807G>A. The alignment of the transcript NM_003954 to the genome is one position off. This happened either already in the VEP data sources that you use as input, or during generation of the cache file. Sorry for the confusion.

MichaelStromberg commented 1 year ago

Thanks for the quick reply, @heseber !

In an old release of the TSO500 software, Nirvana 3.2.3 was used, and it produced the following incorrect annotation:

Nirvana 3.2.3

{
   "transcript":"NM_003954.3",
   "source":"RefSeq",
   "bioType":"protein_coding",
   "codons":"gCc/gTc",
   "aminoAcids":"A/V",
   "cdnaPos":"1743",
   "cdsPos":"1634",
   "exons":"9/16",
   "proteinPos":"545",
   "geneId":"9020",
   "hgnc":"MAP3K14",
   "consequence":[
      "missense_variant"
   ],
   "hgvsc":"NM_003954.3:c.1634C>T",
   "hgvsp":"NP_003945.2:p.(Ala545Val)",
   "isCanonical":true,
   "proteinId":"NP_003945.2"
}

Subsequent versions of TSO500 used Nirvana 3.2.5.1 and Nirvana 3.2.6. Both provide the correct annotation:

Nirvana 3.2.5.1

{
   "transcript":"NM_003954.3",
   "source":"RefSeq",
   "bioType":"protein_coding",
   "codons":"ggC/ggT",
   "aminoAcids":"G",
   "cdnaPos":"1744",
   "cdsPos":"1635",
   "exons":"9/16",
   "proteinPos":"545",
   "geneId":"9020",
   "hgnc":"MAP3K14",
   "consequence":[
      "synonymous_variant"
   ],
   "hgvsc":"NM_003954.3:c.1635C>T",
   "hgvsp":"NM_003954.3:c.1635C>T(p.(Gly545=))",
   "isCanonical":true,
   "proteinId":"NP_003945.2"
}

Here we see the differences in codons, aminoAcids, cdnaPos, cdsPos, consequence, hgvsc, and hgvsp. That transcript for MAP3K14 is interesting in that the transcript sequence has a C insertion after between positions 764 and 765 relative to the genomic reference. As a result, if that insertion isn't properly accounted for, all the subsequent annotations on that transcript will be offset by one.

Nirvana 3.2.6

Nirvana 3.2.6 uses data directly from RefSeq and therefore annotates this transcript accurately:

{
   "transcript":"NM_003954.5",
   "source":"RefSeq",
   "bioType":"mRNA",
   "codons":"ggC/ggT",
   "aminoAcids":"G",
   "cdnaPos":"1716",
   "cdsPos":"1635",
   "exons":"9/16",
   "proteinPos":"545",
   "geneId":"9020",
   "hgnc":"MAP3K14",
   "consequence":[
      "synonymous_variant"
   ],
   "hgvsc":"NM_003954.5:c.1635C>T",
   "hgvsp":"NM_003954.5:c.1635C>T(p.(Gly545=))",
   "isCanonical":true,
   "proteinId":"NP_003945.2"
}

Nirvana 3.16.1 - 3.19.0

I can also confirm that the normal Nirvana releases (3.16.1 & 3.19.0) also annotate this incorrectly mostly because the input data had some artifacts.

Nirvana 3.20

Our latest internal release, Nirvana 3.20.0, grabs all the genes and transcript data directly from RefSeq and Ensembl. Therefore, like Nirvana 3.2.6, it annotates correctly:

{
   "transcript":"NM_003954.5",
   "source":"RefSeq",
   "bioType":"mRNA",
   "codons":"ggC/ggT",
   "aminoAcids":"G",
   "cdnaPos":"1716",
   "cdsPos":"1635",
   "exons":"9/16",
   "proteinPos":"545",
   "geneId":"9020",
   "hgnc":"MAP3K14",
   "consequence":[
      "synonymous_variant"
   ],
   "hgvsc":"NM_003954.5:c.1635C>T",
   "hgvsp":"NP_003945.2:p.(Gly545=)",
   "isCanonical":true,
   "proteinId":"NP_003945.2"
}
heseber commented 1 year ago

Dear Michael, thank you very much for looking into this and for all the explanations. This means that we should encourage the CROs we are working with to update their LocalApp software to the latest release with Nirvana 3.2.6 under the hood. I was using the release 3.18.1 (I don't think 3.19.0 has already been released?) because it also works for GRCh38. Many CROs provide mutations based on GRCh37 (for panels other than TSO), so for those it would be an option to use 3.2.6 instead of 3.18.1. This is obviously not an option when we want to use the up-to-date genome version GRCh38. I heard rumors that your internal version 3.20 and further versions won't be released publicly, please correct me if that is not true.