Ensembl / VEP_plugins

Plugins for the Ensembl Variant Effect Predictor (VEP)
Apache License 2.0
138 stars 114 forks source link

Should the Downstream plugin predict stop_lost downstream sequences #286

Closed susannasiebert closed 2 years ago

susannasiebert commented 4 years ago

Right now it appears that the Downstream plugin doesn't output a predicted downstream sequence for stop_lost variants (or it is just a sequence of multiple Xs). Is this the expected behavior? Would it be possible to add this feature?

helensch commented 4 years ago

Hi

I am looking into the downstream sequence for frameshift,stop_lost variants.

Please could you let me know which version of VEP you are using and the vep command being used with the Downstream plugin.

Thanks

susannasiebert commented 4 years ago

It looks like the particular VCF I'm looking at is a bit older and was annotated with VEP 84. Here is the VEP header: ##VEP=v84 cache=/var/lib/cwl/stgcfd43713-2826-47bb-88e5-98975e7395ce/cache/homo_sapiens/84_GRCh38 db=. dbSNP=146 genebuild=2014-07 COSMIC=75 polyphen=2.2.2 regbuild=13.0 sift=sift5.2.2 ClinVar=201601 gencode=GENCODE 22 HGMD-PUBLIC=20154 ESP=20141103 assembly=GRCh38.p5

helensch commented 4 years ago

Hi

Are you running VEP in offline mode (using the flag --offline)?

If VEP is run in offline mode using the flag --offline, a FASTA file is required to get the sequences for the 3' UTR.

Sequence may be incomplete without a FASTA file or database connection

I have updated the documentation for the plugin for future releases.

Thank you for flagging this issue.

Helen

huimingx commented 4 years ago

Hi Helen,

I've been testing the downstream plugin with stop lost variants with Susanna and it seems like with the following command options I still fail to obtain a downstream sequence:

 --vcf -term SO -transcript_version --offline --cache --symbol --dir ./VEP_cache --check_existing --flag_pick --fasta all_sequences.fa --plugin Downstream --plugin Wildtype --everything --assembly GRCh38 --cache_version 95 --species homo_sapiens

The example variant I'm looking at is : chr1:212360768 ref:TA var:T

Thanks.

helensch commented 4 years ago

Hi

Are you getting any information returned on the change in length relative to the reference protein?

The Downstream plug returns 2 fields -DownstreamProtein : Predicted downstream translation for frameshift mutations -ProteinLengthChange : Predicted change in protein product length

When I ran VEP with the Downstream plugin for your example variant the following was returned:

Location chr1:212360769 Allele - Consequence frameshift_variant,stop_lost Amino_acids */X Codons tAa/ta DownstreamProtein
ProteinLengthChange 1

Are you getting a value returned for ProteinLengthChange?

There is no downstream protein as the downstream sequence start with 'A' and result is a stop codon.

To test the Downstream plugin is returning sequence, an example variant to use is

CHROM POS ID REF ALT QUAL FILTER INFO

19 643600 test_2 CCT C . . .

Location 19:643601-643602 Allele - Consequence frameshift_variant,stop_lost Amino_acids S*/SX Codons tcCTga/tcga DownstreamProtein SRP ProteinLengthChange 3

Regards Helen

susannasiebert commented 4 years ago

Hi Helen,

Thank you for looking into this. We confirmed that the example variant results in the expected DownstreamProtein sequence. We also identified a few similar variants in our VCFs so we think we have our VEP commands working correctly now.

A more general question for the VEP Consequence annotation would be whether variants that result in basically "replacing" the stop codon should have a Consequence of stop_retained_variant instead of stop_lost.

aparton commented 4 years ago

Hi @susannasiebert, @huimingx

I'm glad to hear that you've got your VEP commands working correctly.

Regarding your more general question of when we assign stop_retained_variant and stop_lost, we take our consequence terms and descriptions from the Sequence Ontology database, and we use the following definition for stop_lost:

stop_lost: http://www.sequenceontology.org/browser/current_release/term/SO:0001578 - "A sequence variant where at least one base of the terminator codon (stop) is changed, resulting in an elongated transcript."

So the consequence we assign depends on where a theoretical new-stop-codon is positioned.

With the release of Ensembl 100 (officially released this afternoon), we have introduced the option --shift_3prime into VEP, where insertions and deletions within repeated regions will be shifted as far as possible in the 3' direction before consequence calculation. In the example provided by @huimingx above, this will now correctly provide a downstream consequence for your variant - see: http://rest.ensembl.org/vep/human/region/1:212360768-212360769/T?shift_3prime=1&content-type=application/json&minimal=1

If you have any other issues or if there's anything else we can do to help, please feel free to get in touch.

Kind Regards, Andrew

susannasiebert commented 4 years ago

Hi @aparton,

Sorry to bother you about this again. We thought we had it fixed but we're still seeing some odd behavior. We are not seeing any XXXs in the DownstreamProtein field anymore, but we are also not seeing any downstream sequence predictions if the variant is stop_lost only (it works for frameshift_variant&stop_lost). For example:

1   158095120   1_158095120_G/T G   T   .   .   CSQ=T|stop_lost|HIGH|KIRREL1|ENSG00000183853|Transcript|ENST00000359209.10|protein_coding|15/15||ENST00000359209.10:c.2274G>T|ENSP00000352138.6:p.Ter758TyrextTer85|2341|2274|758|*/Y|taG/taT|||1||1|SNV|HGNC|HGNC:15734|YES|1|P1|CCDS1172.2|ENSP00000352138|Q96J84||UPI0000443FBD|||||||||||||||||||||||||||||||||||||MLSLLVWILTLSDTFSQGTQTRFSQEPADQTVVAGQRAVLPCVLLNYSGIVQWTKDGLALGMGQGLKAWPRYRVVGSADAGQYNLEITDAELSDDASYECQATEAALRSRRAKLTVLIPPEDTRIDGGPVILLQAGTPHNLTCRAFNAKPAATIIWFRDGTQQEGAVASTELLKDGKRETTVSQLLINPTDLDIGRVFTCRSMNEAIPSGKETSIELDVHHPPTVTLSIEPQTVQEGERVVFTCQATANPEILGYRWAKGGFLIEDAHESRYETNVDYSFFTEPVSCEVHNKVGSTNVSTLVNVHFAPRIVVDPKPTTTDIGSDVTLTCVWVGNPPLTLTWTKKDSNMVLSNSNQLLLKSVTQADAGTYTCRAIVPRIGVAEREVPLYVNGPPIISSEAVQYAVRGDGGKVECFIGSTPPPDRIAWAWKENFLEVGTLERYTVERTNSGSGVLSTLTINNVMEADFQTHYNCTAWNSFGPGTAIIQLEEREVLPVGIIAGATIGASILLIFFFIALVFFLYRRRKGSRKDVTLRKLDIKVETVNREPLTMHSDREDDTASVSTATRVMKAIYSSFKDDVDLKQDLRCDTIDTREEYEMKDPTNGYYNVRAHEDRPSSRAVLYADYRAPGPARFDGRPSSRLSHSSGYAQLNTYSRGPASDYGPEPTPPGPAAPAGTDTTSQLSYENYEKFNSHPFPGAAGYPTYRLGYPQAPPSGLERTPYEAYDPIGKYATATRFSYTSQHSDYGQRFQQRMQTHV|||||||||||||||||||,T|stop_lost|HIGH|KIRREL1|ENSG00000183853|Transcript|ENST00000360089.8|protein_coding|13/13||ENST00000360089.8:c.1782G>T|ENSP00000353202.4:p.Ter594TyrextTer85|2373|1782|594|*/Y|taG/taT|||1|||SNV|HGNC|HGNC:15734||1|||ENSP00000353202||Q5W0F9|UPI00001AA15B|||||||||||||||||||||||||||||||||||||MGQGLKAWPRYRVVGSADAGQYNLEITDAELSDDASYECQATEAALRSRRAKLTVLNPPTVTLSIEPQTVQEGERVVFTCQATANPEILGYRWAKGGFLIEDAHESRYETNVDYSFFTEPVSCEVHNKVGSTNVSTLVNVHFAPRIVVDPKPTTTDIGSDVTLTCVWVGNPPLTLTWTKKDSNMVLSNSNQLLLKSVTQADAGTYTCRAIVPRIGVAEREVPLYVNGPPIISSEAVQYAVRGDGGKVECFIGSTPPPDRIAWAWKENFLEVGTLERYTVERTNSGSGVLSTLTINNVMEADFQTHYNCTAWNSFGPGTAIIQLEEREVLPVGIIAGATIGASILLIFFFIALVFFLYRRRKGSRKDVTLRKLDIKVETVNREPLTMHSDREDDTASVSTATRVMKAIYSSFKDDVDLKQDLRCDTIDTREEYEMKDPTNGYYNVRAHEDRPSSRAVLYADYRAPGPARFDGRPSSRLSHSSGYAQLNTYSRGPASDYGPEPTPPGPAAPAGTDTTSQLSYENYEKFNSHPFPGAAGYPTYRLGYPQAPPSGLERTPYEAYDPIGKYATATRFSYTSQHSDYGQRFQQRMQTHV|||||||||||||||||||,T|stop_lost|HIGH|KIRREL1|ENSG00000183853|Transcript|ENST00000368172.1|protein_coding|11/11||ENST00000368172.1:c.1716G>T|ENSP00000357154.1:p.Ter572TyrextTer85|1728|1716|572|*/Y|taG/taT|||1|||SNV|HGNC|HGNC:15734||2|||ENSP00000357154||Q5W0G0|UPI0000047A8F|||||||||||||||||||||||||||||||||||||MNEAIPSGKETSIELDVHHPPTVTLSIEPQTVQEGERVVFTCQATANPEILGYRWAKGGFLIEDAHESRYETNVDYSFFTEPVSCEVHNKVGSTNVSTLVNVHFAPRIVVDPKPTTTDIGSDVTLTCVWVGNPPLTLTWTKKDSNMGPRPPGSPPEAALSAQVLSNSNQLLLKSVTQADAGTYTCRAIVPRIGVAEREVPLYVNGPPIISSEAVQYAVRGDGGKVECFIGSTPPPDRIAWAWKENFLEVGTLERYTVERTNSGSGVLSTLTINNVMEADFQTHYNCTAWNSFGPGTAIIQLEEREVLPVGIIAGATIGASILLIFFFIALVFFLYRRRKGSRKDVTLRKLDIKVETVNREPLTMHSDREDDTASVSTATRVMKAIYSSFKDDVDLKQDLRCDTIDTREEYEMKDPTNGYYNVRAHEDRPSSRAVLYADYRAPGPARFDGRPSSRLSHSSGYAQLNTYSRGPASDYGPEPTPPGPAAPAGTDTTSQLSYENYEKFNSHPFPGAAGYPTYRLGYPQAPPSGLERTPYEAYDPIGKYATATRFSYTSQHSDYGQRFQQRMQTHV|||||||||||||||||||,T|stop_lost|HIGH|KIRREL1|ENSG00000183853|Transcript|ENST00000368173.7|protein_coding|13/13||ENST00000368173.7:c.1974G>T|ENSP00000357155.4:p.Ter658TyrextTer85|2378|1974|658|*/Y|taG/taT|||1|||SNV|HGNC|HGNC:15734||2||CCDS72952.1|ENSP00000357155||B4DN67|UPI00017A76F9|||||||||||||||||||||||||||||||||||||MLSLLVWILTLSDTFSQVPPEDTRIDGGPVILLQAGTPHNLTCRAFNAKPAATIIWFRDGTQQEGAVASTELLKDGKRETTVSQLLINPTDLDIGRVFTCRSMNEAIPSGKETSIELDVHHPPTVTLSIEPQTVQEGERVVFTCQATANPEILGYRWAKGGFLIEDAHESRYETNVDYSFFTEPVSCEVHNKVGSTNVSTLVNVHFAPRIVVDPKPTTTDIGSDVTLTCVWVGNPPLTLTWTKKDSNMVLSNSNQLLLKSVTQADAGTYTCRAIVPRIGVAEREVPLYVNGPPIISSEAVQYAVRGDGGKVECFIGSTPPPDRIAWAWKENFLEVGTLERYTVERTNSGSGVLSTLTINNVMEADFQTHYNCTAWNSFGPGTAIIQLEEREVLPVGIIAGATIGASILLIFFFIALVFFLYRRRKGSRKDVTLRKLDIKVETVNREPLTMHSDREDDTASVSTATRVMKAIYSSFKDDVDLKQDLRCDTIDTREEYEMKDPTNGYYNVRAHEDRPSSRAVLYADYRAPGPARFDGRPSSRLSHSSGYAQLNTYSRGPASDYGPEPTPPGPAAPAGTDTTSQLSYENYEKFNSHPFPGAAGYPTYRLGYPQAPPSGLERTPYEAYDPIGKYATATRFSYTSQHSDYGQRFQQRMQTHV|||||||||||||||||||,T|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSR00000254728|CTCF_binding_site|||||||||||||||SNV||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

This is a TAG mutated to a TAT which codes for tyrosine and not a new stop codon so I would expect a DownstreamProtein prediction. We also aren't seeing any values for the ProteinLengthChange for the stop_lost only variants.

The CSQ header is

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|PICK|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|SOURCE|GENE_PHENO|SIFT|PolyPhen|DOMAINS|miRNA|HGVS_OFFSET|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|AA_AF|EA_AF|gnomAD_AF|gnomAD_AFR_AF|gnomAD_AMR_AF|gnomAD_ASJ_AF|gnomAD_EAS_AF|gnomAD_FIN_AF|gnomAD_NFE_AF|gnomAD_OTH_AF|gnomAD_SAS_AF|MAX_AF|MAX_AF_POPS|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|DownstreamProtein|ProteinLengthChange|WildtypeProtein|gnomADe|gnomADe_AF|gnomADe_AF_AFR|gnomADe_AF_AMR|gnomADe_AF_ASJ|gnomADe_AF_EAS|gnomADe_AF_FIN|gnomADe_AF_NFE|gnomADe_AF_OTH|gnomADe_AF_SAS|clinvar|clinvar_CLINSIGN|clinvar_PHENOTYPE|clinvar_SCORE|clinvar_RCVACC|clinvar_TESTEDINGTR|clinvar_PHENOTYPELIST|clinvar_NUMSUBMIT|clinvar_GUIDELINES">

and the VEP command we ran is

/usr/bin/perl -I /opt/lib/perl/VEP/Plugins /usr/bin/variant_effect_predictor.pl --vcf -term SO -transcript_version --offline --cache --symbol -o TCGA_300_stop_lost_vcf_input_format_v95_with_test.vcf -I TCGA_300_stop_lost_vcf_input_format.bed --synonyms /gscmnt/gc2560/core/model_data/2887491634/build50f99e75d14340ffb5b7d21b03887637/chromAlias.ensembl.txt --dir /gscmnt/gc2560/core/cwl/inputs/VEP_cache --check_existing --custom /gscmnt/gc2560/core/model_data/genome-db-ensembl-gnomad/2dd4b53431674786b760adad60a29273/fixed_b38_exome.vcf.gz,gnomADe,vcf,exact,1,AF,AF_AFR,AF_AMR,AF_ASJ,AF_EAS,AF_FIN,AF_NFE,AF_OTH,AF_SAS --custom /gscmnt/gc2560/core/custom_clinvar_vcf/v20181028/custom.vcf.gz,clinvar,vcf,exact,1,CLINSIGN,PHENOTYPE,SCORE,RCVACC,TESTEDINGTR,PHENOTYPELIST,NUMSUBMIT,GUIDELINES --flag_pick --fasta /gscmnt/gc2560/core/model_data/2887491634/build21f22873ebe0486c8e6f69c15435aa96/all_sequences.fa --plugin Downstream --plugin Wildtype --everything --assembly GRCh38 --cache_version 95 --species homo_sapiens
helensch commented 4 years ago

Hi @susannasiebert

The Downstream plugin predicts the downstream effects of a frameshift variant on the protein sequence of a transcript. It does not predict for 'stop_lost'.

https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html#downstream

Regards Helen

susannasiebert commented 4 years ago

Is there a supported plugin for this use case?

helensch commented 4 years ago

Hi @susannasiebert

There is not a supported plugin for this use case.

I will discuss with the team if this functionality can be included in a Plugin. However may only be a functionality in the longer term.

Regards Helen

helensch commented 2 years ago

Hi @susannasiebert

Your request for including stop_lost was added to our work list for investigation.

I will close off this ticket, but we will contact you if we do make this change.

Please feel free to reopen the ticket or open a new one if you have further questions.

Regards Helen