Open jonn-smith opened 6 years ago
A temporary fix is in that will only annotate variants as in a CDS if they are contained within that CDS.
This is related to #4804
Can oncotator handle this?
Funcotator and Oncotator both produce the same results for chr5:71622537-71622538 CA->C
. They both seem to ignore the transcript that runs off the end of the boundary.
GencodeFuncotationFactory::getContainingGtfSubfeature
contains a "fix" to get around this issue.
This is related to #3749
For the following variants:
chr1 819955 . TTCACGAATT T . PASS .
chr1 819979 . CATGAGCAT C . PASS .
VEP seems to produce incorrect data:
##fileformat=VCFv4.1
##VEP="v94" time="2018-10-30 19:13:50" cache="/nfs/public/release/ensweb-data/latest/tools/grch37/e94/vep/cache/homo_sapiens/94_GRCh37" db="homo_sapiens_core_94_37@hh-mysql-ens-grch37-web" 1000genomes="phase3" COSMIC="81" ClinVar="201706" ESP="20141103" HGMD-PUBLIC="20164" assembly="GRCh37.p13" dbSNP="150" gencode="GENCODE 19" genebuild="2011-04" gnomAD="170228" polyphen="2.2.2" regbuild="1.0" sift="sift5.2.2"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID|TSL|APPRIS|ENSP|SIFT|PolyPhen|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|AA_AF|EA_AF|gnomAD_AF|gnomAD_AFR_AF|gnomAD_AMR_AF|gnomAD_ASJ_AF|gnomAD_EAS_AF|gnomAD_FIN_AF|gnomAD_NFE_AF|gnomAD_OTH_AF|gnomAD_SAS_AF|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE">
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 819955 . TTCACGAATT T . PASS CSQ=-|splice_acceptor_variant&coding_sequence_variant&intron_variant|HIGH|AL645608.2|ENSG00000269308|Transcript|ENST00000594233|protein_coding|3/3|2/2|||?-38|?-38|?-13|||||1||Clone_based_ensembl_gene||||ENSP00000470877|||||||||||||||||||||||||||
chr1 819979 . CATGAGCAT C . PASS CSQ=-|coding_sequence_variant&3_prime_UTR_variant|MODIFIER|AL645608.2|ENSG00000269308|Transcript|ENST00000594233|protein_coding|3/3||||54-?|54-?|18-?|||||1||Clone_based_ensembl_gene||||ENSP00000470877|||||||||||||||||||||||||||
There is no cDNA string / amino acid change and the positions have ?
characters in them (not 100% sure the question marks are wrong - there is no real spec for these fields).
Oncotator seems to produce incorrect data:
##fileformat=VCFv4.1
##fileDate=Thu Sept 6 15:53:20 UTC 2018
##reference=/cromwell_root/broad-references/hg19/v0/Homo_sapiens_assembly19.fasta
##oncotator_version=Oncotator_v1.10.0.0dev_|_Flat_File_Reference_hg19_|_GENCODE_v19_CANONICAL_
##INFO=<ID=annotation_transcript,Number=.,Type=String,Description="Unknown">
##INFO=<ID=gencode_transcript_status,Number=.,Type=String,Description="Unknown">
##INFO=<ID=HGVS_protein_change,Number=.,Type=String,Description="Unknown">
##INFO=<ID=genome_change,Number=.,Type=String,Description="Unknown">
##INFO=<ID=ccds_id,Number=.,Type=String,Description="Unknown">
##INFO=<ID=codon_change,Number=.,Type=String,Description="Unknown">
##INFO=<ID=HGVS_coding_DNA_change,Number=.,Type=String,Description="Unknown">
##INFO=<ID=transcript_position,Number=.,Type=String,Description="Unknown">
##INFO=<ID=transcript_change,Number=.,Type=String,Description="Unknown">
##INFO=<ID=other_transcripts,Number=.,Type=String,Description="Unknown">
##INFO=<ID=HGVS_genomic_change,Number=.,Type=String,Description="Unknown">
##INFO=<ID=gencode_transcript_tags,Number=.,Type=String,Description="Unknown">
##INFO=<ID=variant_classification,Number=.,Type=String,Description="Unknown">
##INFO=<ID=secondary_variant_classification,Number=.,Type=String,Description="Unknown">
##INFO=<ID=gene_id,Number=.,Type=String,Description="Unknown">
##INFO=<ID=gencode_transcript_name,Number=.,Type=String,Description="Unknown">
##INFO=<ID=transcript_exon,Number=.,Type=String,Description="Unknown">
##INFO=<ID=havana_transcript,Number=.,Type=String,Description="Unknown">
##INFO=<ID=gene_type,Number=.,Type=String,Description="Unknown">
##INFO=<ID=gencode_transcript_type,Number=.,Type=String,Description="Unknown">
##INFO=<ID=protein_change,Number=.,Type=String,Description="Unknown">
##INFO=<ID=ref_context,Number=.,Type=String,Description="Unknown">
##INFO=<ID=variant_type,Number=.,Type=String,Description="Unknown">
##INFO=<ID=transcript_id,Number=.,Type=String,Description="Unknown">
##INFO=<ID=gc_content,Number=.,Type=String,Description="Unknown">
##INFO=<ID=gene,Number=.,Type=String,Description="Unknown">
##INFO=<ID=transcript_strand,Number=.,Type=String,Description="Unknown">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<ID=15,length=102531392>
##contig=<ID=16,length=90354753>
##contig=<ID=17,length=81195210>
##contig=<ID=18,length=78077248>
##contig=<ID=19,length=59128983>
##contig=<ID=20,length=63025520>
##contig=<ID=21,length=48129895>
##contig=<ID=22,length=51304566>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##contig=<ID=MT,length=16569>
##contig=<ID=GL000207.1,length=4262>
##contig=<ID=GL000226.1,length=15008>
##contig=<ID=GL000229.1,length=19913>
##contig=<ID=GL000231.1,length=27386>
##contig=<ID=GL000210.1,length=27682>
##contig=<ID=GL000239.1,length=33824>
##contig=<ID=GL000235.1,length=34474>
##contig=<ID=GL000201.1,length=36148>
##contig=<ID=GL000247.1,length=36422>
##contig=<ID=GL000245.1,length=36651>
##contig=<ID=GL000197.1,length=37175>
##contig=<ID=GL000203.1,length=37498>
##contig=<ID=GL000246.1,length=38154>
##contig=<ID=GL000249.1,length=38502>
##contig=<ID=GL000196.1,length=38914>
##contig=<ID=GL000248.1,length=39786>
##contig=<ID=GL000244.1,length=39929>
##contig=<ID=GL000238.1,length=39939>
##contig=<ID=GL000202.1,length=40103>
##contig=<ID=GL000234.1,length=40531>
##contig=<ID=GL000232.1,length=40652>
##contig=<ID=GL000206.1,length=41001>
##contig=<ID=GL000240.1,length=41933>
##contig=<ID=GL000236.1,length=41934>
##contig=<ID=GL000241.1,length=42152>
##contig=<ID=GL000243.1,length=43341>
##contig=<ID=GL000242.1,length=43523>
##contig=<ID=GL000230.1,length=43691>
##contig=<ID=GL000237.1,length=45867>
##contig=<ID=GL000233.1,length=45941>
##contig=<ID=GL000204.1,length=81310>
##contig=<ID=GL000198.1,length=90085>
##contig=<ID=GL000208.1,length=92689>
##contig=<ID=GL000191.1,length=106433>
##contig=<ID=GL000227.1,length=128374>
##contig=<ID=GL000228.1,length=129120>
##contig=<ID=GL000214.1,length=137718>
##contig=<ID=GL000221.1,length=155397>
##contig=<ID=GL000209.1,length=159169>
##contig=<ID=GL000218.1,length=161147>
##contig=<ID=GL000220.1,length=161802>
##contig=<ID=GL000213.1,length=164239>
##contig=<ID=GL000211.1,length=166566>
##contig=<ID=GL000199.1,length=169874>
##contig=<ID=GL000217.1,length=172149>
##contig=<ID=GL000216.1,length=172294>
##contig=<ID=GL000215.1,length=172545>
##contig=<ID=GL000205.1,length=174588>
##contig=<ID=GL000219.1,length=179198>
##contig=<ID=GL000224.1,length=179693>
##contig=<ID=GL000223.1,length=180455>
##contig=<ID=GL000195.1,length=182896>
##contig=<ID=GL000212.1,length=186858>
##contig=<ID=GL000222.1,length=186861>
##contig=<ID=GL000200.1,length=187035>
##contig=<ID=GL000193.1,length=189789>
##contig=<ID=GL000194.1,length=191469>
##contig=<ID=GL000225.1,length=211173>
##contig=<ID=GL000192.1,length=547496>
##contig=<ID=NC_007605,length=171823>
#CHROM POS ID REF ALT QUAL FILTER INFO
1 819955 . TTCACGAATT T . PASS annotation_transcript=ENST00000594233.1;gencode_transcript_status=NOVEL;genome_change=g.chr1:819956_819964delTCACGAATT;codon_change=c.(34-39)tcacga>a;HGVS_coding_DNA_change=ENST00000594233.1:c.35-1TCACGAATT>-;transcript_position=34_38;transcript_change=c.34_38delTCACGAATT;HGVS_genomic_change=chr1.hg19:g.819956_819964delTCACGAATT;gencode_transcript_tags=basic|appris_principal;variant_classification=Splice_Site;secondary_variant_classification=In_Frame_Del;gencode_transcript_name=AL645608.2-201;transcript_exon=3;gene_type=protein_coding;gencode_transcript_type=protein_coding;protein_change=p.SR12del;ref_context=atggctattttcacgaattaattcttccg;variant_type=DEL;transcript_id=ENST00000594233.1;gc_content=0.349;gene=AL645608.2;transcript_strand=+
1 819979 . CATGAGCAT C . PASS annotation_transcript=ENST00000594233.1;gencode_transcript_status=NOVEL;HGVS_protein_change=Exception_encountered;genome_change=g.chr1:819980_819987delATGAGCAT;HGVS_coding_DNA_change=ENST00000594233.1:c.54_*4delATGAGCAT;transcript_position=54_57;HGVS_genomic_change=chr1.hg19:g.819980_819987delATGAGCAT;gencode_transcript_tags=basic|appris_principal;variant_classification=Stop_Codon_Del;gencode_transcript_name=AL645608.2-201;transcript_exon=0;gene_type=protein_coding;gencode_transcript_type=protein_coding;ref_context=ttccgtatccatgagcatggaatgcttc;variant_type=DEL;transcript_id=ENST00000594233.1;gc_content=0.370;gene=AL645608.2;transcript_strand=+
The protein change / cdna string / coding string fields are all incorrect. The positions in the transcript may also be incorrect.
Note that this ticket also applies to variants that span between the UTR and flanking regions.
The way start positions are calculated causes issues when alleles span the boundaries of exons and the coding sequence itself.
For exon boundaries:
The error stems from
FuncotatorUtils::getStartPositionInCodingSequence
and how the results of that method call are used in theGencodeFuncotationFactory
.In addition: Funcotator must be able to handle indels that span exon start boundaries.
For example, in hg19 the following variant is not handled properly:
chr5:71622537-71622538 CA->C
The current workaround is to throw a FuncotatorUtils.TranscriptCodingSequenceException for the transcript causing this problem in Gencode.
For Coding sequence boundaries, the following variant in hg38 causes a problem: