Closed jasonwalker80 closed 7 years ago
I suspect this will add to the run time significantly. On the VEP docs, it claims 50-80% of the annotation time is creating HGVS notation: http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_hgvs
Probably still worth it... Would be good to have the option at least. Can we try it and see what the performance is like and how good of a job it does?
bsub -q long -o $SOMATIC_HOME/logs/exome_vep.out -e $SOMATIC_HOME/logs/exome_vep.err -M 16000000 -R 'select[mem>16000] rusage[mem=16000]' /usr/bin/perl $SOMATIC_HOME/software/ensembl-vep/vep.pl -i $SOMATIC_HOME/exome.merged.fpfilter.pass.vcf.gz --offline --af_exac --coding_only --hgvs --cache --dir $VEP_CACHE --format vcf --vcf --plugin Downstream --plugin Wildtype --symbol --terms SO --flag_pick -o $SOMATIC_HOME/exome.merged.fpfilter.pass.annotated.vcf.gz
Job <1293598> is submitted to queue
Adding HGVS notation increased the exome VEP annotation run time ~10x : ~5minutes to ~50 minutes.
The VCF contains ~1200 variants.
@susannasiebert know that bam-readcount output is in the workflow, can we add this option to VEP? I think that's step 1 of a 2 part process. The second step is adding your parsing logic to the VCF to Table converter (probably post) to output the c. and p. notation as columns in the TSV file.
Absolutely. Do we want to always run VEP with this option or should it be optional in the workflow?
Given the significant amount of resources, I think this should be optional. It's a CAP requirement so with non-CLE project it could save a significant amount of time when toggled off.
I'm working on the parser for the VEP CSQ field. The CSQ field will contain multiple entries for the various transcripts affected by the variant as well as the different alts that might be present. How do we want to represent this in the report tsv? Multiple alts are currently reported on the same line, i.e., one line per location. Conversely, we need to encode the multiple alt alleles as well as the multiple transcripts per alt.
With the VEP option we use isn't one of the transcripts flagged as the canonical or primary feature?
Yes, and it picks one for the whole variant, regardless of alt alleles. So just use that one in the report?
For now, yes. Let's use the one VEP picks for the final text report.
Not all variants will have a canonical transcript and I believe that there can also be cases with multiple canonical transcripts. How should we handle these edge cases?
Summary of offline conversation:
@susannasiebert will look into variants without the 'PICK' tag. There SHOULD only be one annotation with that tag. The canonical status of a transcript is part of the pick order priority but is alone not the only criteria used to pick one consensus annotation per variant.
The first example is a 37bp DEL.
The VEP CSQ header:
Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|PICK|SYMBOL_SOURCE|HGNC_ID|DownstreamProtein|ProteinLengthChange|WildtypeProtein
The full VCF entry:
chr1 155086199 . GTGCTGGGTGAGTCTGCGCAGCGCCCTCTGGTGGCCAC G . PASS AC=1;AF=0.250;AN=4;DP=119;END=155086236;GPV=1;HOMLEN=6;HOMSEQ=TGCTGG;IC=0;IHP=3;MQ=59.97;MQ0=0;NT=ref;QSI=230;QSI_NT=3070;RC=1;RU=TGCTGGGTGAGTCTGCGCAGCGCCCTCTGGTGGCCAC;SGT=ref->het;SOMATIC;SPV=5.4884e-06;SS=2;SSC=52;SVLEN=-37;SVTYPE=DEL;TQSI=1;TQSI_NT=1;set=filtered;CSQ=deletion|coding_sequence_variant&intron_variant&feature_truncation|MODIFIER|EFNA3|ENSG00000143590|Transcript|ENST00000368408|protein_coding|4/5|4/4|||651-?|581-?|194-?|||||1|||HGNC|HGNC:3223|||||||||||||||,deletion|intron_variant&non_coding_transcript_variant&feature_truncation|MODIFIER|EFNA3|ENSG00000143590|Transcript|ENST00000470294|processed_transcript||2/2||||||||||1|||HGNC|HGNC:3223|||||||||||||||,deletion|non_coding_transcript_exon_variant&intron_variant&non_coding_transcript_variant&feature_truncation|MODIFIER|EFNA3|ENSG00000143590|Transcript|ENST00000498667|processed_transcript|3/4|3/3|||260-?|||||||1|||HGNC|HGNC:3223|||||||||||||||,deletion|coding_sequence_variant&intron_variant&feature_truncation|MODIFIER|RP11-540D14.8|ENSG00000251246|Transcript|ENST00000505139|protein_coding|4/5|4/4|||642-?|566-?|189-?|||||1|||Clone_based_vega_gene|||||||||||||||| GT:AD:DP:DP4:FREQ:RD ./. ./. 0/0:0:65:45,20,0,0:
The other weirdness with the above CSQ annotation is that the allele
portion is not the VEP-version of the actual alternate allele but the string deletion
. Another "allele" I've encountered is RPL
. I'm not sure how VEP makes that determination but it's not something that my converter is currently able to handle.
Not all indels are affected by this problem.
I sent an email to the Ensembl help desk (http://www.ensembl.org/Multi/Help/Contact). Their helpdesk ticketing system is private but I will report back with their reply.
For historical reference, https://jira.gsc.wustl.edu/browse/CI-49
Ensure the variants TSV file also contains a
c.
andp.
syntax for the variants called. Investigate VEP features that would provide this and how we then translate it to the TSV file. If it's a single INFO tag, it's easy, but I suspect it's embedded in the VEP annotation.