Illumina / Nirvana

The nimble & robust variant annotator
https://illumina.github.io/NirvanaDocumentation/
GNU General Public License v3.0
167 stars 44 forks source link

v3.19 Production build annotates wrong transcripts as canonical #104

Closed jhkbg closed 1 year ago

jhkbg commented 1 year ago

I noticed that at least in one instance, the canonical transcript annotation for Ensembl is incorrect in v3.19.

Nirvana marks an Ensembl transcript as canonical that does not correspond to the RefSeq transcript. Ensembl itself lists another transcript as canonical, which does match the RefSeq canonical.

Gene: TP53, NCBI 7157 Canonical RefSeq transcript: NM_000546 Canonical Ensembl: ENST00000269305 Nirvana-reported canonical Ensembl: ENST00000610292

Source: http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000141510;r=17:7661779-7687538

This used to not be an issue in the past, and looks like it's fixed in 3.20. I need you to investigate the size of this issue (how many other genes?) so that we can decide if and how we need to notify customers to re-analyze their samples. Note that this gene is part of TSO500, so there might be several products affected. In ICA Cohorts, we're looking to re-analyze some 10K WGS samples, depending on the extent of this issue.

Env: US Prod Lambda Service. JSON header: {"header":{"annotator":"Nirvana 3.19.0","creationTime":"2023-05-03 22:54:06","genomeAssembly":"GRCh38","schemaVersion":6,"dataVersion":"91.27.67","dataSources":[{"name":"VEP","version":"91","description":"BothRefSeqAndEnsembl","releaseDate":"2017-12-18"}

ENST00000269305 and NM_000546 are the canonical transcripts acc/ Ensembl. They used to be, too, in previous outputs of Nirvana. I am not sure since when this issue exists in Nirvana. I have some older Nirvana results from last summer and they are not affected by this issue -- ENST00000269305 is marked as canonical. (These are in a DB w/o Nirvana header/version info though.)

Input: DRAGEN re'sequed 1000 Genomes sample HG00097, GRCh38 gzip -c -d DRAGEN-1KGP-3202-HG00097.hard-filtered.vcf.gz | grep 17 | grep 7676154 chr17 7676154 . G C 50.00 PASS AC=1;AF=0.500;AN=2;DP=32;FS=0.000;MQ=250.00;MQRankSum=4.719;QD=1.56;ReadPosRankSum=2.878;SOR=0.681;FractionInformativeReads=1.000;R2_5P_bias=-2.088 GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB 0/1:13,19:0.594:32:7,7:6,12:47:85,0,48:5.0000e+01,8.1800e-05,5.0542e+01:0.00,34.77,37.77:6,7,10,9:6,7,12,7

JSON output (trimmed): {**"transcript":"ENST00000610292.4"**,"source":"Ensembl","bioType":"protein_coding","codons":"cCc/cGc","aminoAcids":"P/R","cdnaPos":"465","cdsPos":"98","exons":"3/10","proteinPos":"33","geneId":"ENSG00000141510","hgnc":"TP53","consequence":["missense_variant"],"hgvsc":"ENST00000610292.4:c.98C>G","hgvsp":"ENSP00000478219.1:p.(Pro33Arg)",**"isCanonical":true**,"polyPhenScore":0.045,"polyPhenPrediction":"benign","proteinId":"ENSP00000478219.1","siftScore":0.26,"siftPrediction":"tolerated"},

{**"transcript":"ENST00000269305.8"**,"source":"Ensembl","bioType":"protein_coding","codons":"cCc/cGc","aminoAcids":"P/R","cdnaPos":"405","cdsPos":"215","exons":"4/11","proteinPos":"72","geneId":"ENSG00000141510","hgnc":"TP53","consequence":["missense_variant"],"hgvsc":"ENST00000269305.8:c.215C>G","hgvsp":"ENSP00000269305.4:p.(Pro72Arg)","polyPhenScore":0.045,"polyPhenPrediction":"benign","proteinId":"ENSP00000269305.4","siftScore":0.57,"siftPrediction":"tolerated"},

{**"transcript":"NM_000546.5"**,"source":"RefSeq","bioType":"protein_coding","codons":"cCc/cGc","aminoAcids":"P/R","cdnaPos":"417","cdsPos":"215","exons":"4/11","proteinPos":"72","geneId":"7157","hgnc":"TP53","consequence":["missense_variant"],"hgvsc":"NM_000546.5:c.215C>G","hgvsp":"NP_000537.3:p.(Pro72Arg)"**,"isCanonical":true**,"polyPhenScore":0.045,"polyPhenPrediction":"benign","proteinId":"NP_000537.3","siftScore":0.57,"siftPrediction":"tolerated"},