genome / analysis-workflows

Open workflow definitions for genomic analysis from MGI at WUSM.
MIT License
102 stars 57 forks source link

Variant Annotation Syntax #147

Closed jasonwalker80 closed 7 years ago

jasonwalker80 commented 7 years ago

Ensure the variants TSV file also contains a c. and p. syntax for the variants called. Investigate VEP features that would provide this and how we then translate it to the TSV file. If it's a single INFO tag, it's easy, but I suspect it's embedded in the VEP annotation.

jasonwalker80 commented 7 years ago

I suspect this will add to the run time significantly. On the VEP docs, it claims 50-80% of the annotation time is creating HGVS notation: http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_hgvs

malachig commented 7 years ago

Probably still worth it... Would be good to have the option at least. Can we try it and see what the performance is like and how good of a job it does?

jasonwalker80 commented 7 years ago

bsub -q long -o $SOMATIC_HOME/logs/exome_vep.out -e $SOMATIC_HOME/logs/exome_vep.err -M 16000000 -R 'select[mem>16000] rusage[mem=16000]' /usr/bin/perl $SOMATIC_HOME/software/ensembl-vep/vep.pl -i $SOMATIC_HOME/exome.merged.fpfilter.pass.vcf.gz --offline --af_exac --coding_only --hgvs --cache --dir $VEP_CACHE --format vcf --vcf --plugin Downstream --plugin Wildtype --symbol --terms SO --flag_pick -o $SOMATIC_HOME/exome.merged.fpfilter.pass.annotated.vcf.gz Job <1293598> is submitted to queue .

jasonwalker80 commented 7 years ago

Adding HGVS notation increased the exome VEP annotation run time ~10x : ~5minutes to ~50 minutes.

The VCF contains ~1200 variants.

jasonwalker80 commented 7 years ago

@susannasiebert know that bam-readcount output is in the workflow, can we add this option to VEP? I think that's step 1 of a 2 part process. The second step is adding your parsing logic to the VCF to Table converter (probably post) to output the c. and p. notation as columns in the TSV file.

susannasiebert commented 7 years ago

Absolutely. Do we want to always run VEP with this option or should it be optional in the workflow?

jasonwalker80 commented 7 years ago

Given the significant amount of resources, I think this should be optional. It's a CAP requirement so with non-CLE project it could save a significant amount of time when toggled off.

susannasiebert commented 7 years ago

I'm working on the parser for the VEP CSQ field. The CSQ field will contain multiple entries for the various transcripts affected by the variant as well as the different alts that might be present. How do we want to represent this in the report tsv? Multiple alts are currently reported on the same line, i.e., one line per location. Conversely, we need to encode the multiple alt alleles as well as the multiple transcripts per alt.

jasonwalker80 commented 7 years ago

With the VEP option we use isn't one of the transcripts flagged as the canonical or primary feature?

susannasiebert commented 7 years ago

Yes, and it picks one for the whole variant, regardless of alt alleles. So just use that one in the report?

jasonwalker80 commented 7 years ago

For now, yes. Let's use the one VEP picks for the final text report.

susannasiebert commented 7 years ago

Not all variants will have a canonical transcript and I believe that there can also be cases with multiple canonical transcripts. How should we handle these edge cases?

jasonwalker80 commented 7 years ago

Summary of offline conversation:

@susannasiebert will look into variants without the 'PICK' tag. There SHOULD only be one annotation with that tag. The canonical status of a transcript is part of the pick order priority but is alone not the only criteria used to pick one consensus annotation per variant.

The first example is a 37bp DEL.

The VEP CSQ header:

Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|PICK|SYMBOL_SOURCE|HGNC_ID|DownstreamProtein|ProteinLengthChange|WildtypeProtein

The full VCF entry:

chr1    155086199    .    GTGCTGGGTGAGTCTGCGCAGCGCCCTCTGGTGGCCAC    G    .    PASS    AC=1;AF=0.250;AN=4;DP=119;END=155086236;GPV=1;HOMLEN=6;HOMSEQ=TGCTGG;IC=0;IHP=3;MQ=59.97;MQ0=0;NT=ref;QSI=230;QSI_NT=3070;RC=1;RU=TGCTGGGTGAGTCTGCGCAGCGCCCTCTGGTGGCCAC;SGT=ref->het;SOMATIC;SPV=5.4884e-06;SS=2;SSC=52;SVLEN=-37;SVTYPE=DEL;TQSI=1;TQSI_NT=1;set=filtered;CSQ=deletion|coding_sequence_variant&intron_variant&feature_truncation|MODIFIER|EFNA3|ENSG00000143590|Transcript|ENST00000368408|protein_coding|4/5|4/4|||651-?|581-?|194-?|||||1|||HGNC|HGNC:3223|||||||||||||||,deletion|intron_variant&non_coding_transcript_variant&feature_truncation|MODIFIER|EFNA3|ENSG00000143590|Transcript|ENST00000470294|processed_transcript||2/2||||||||||1|||HGNC|HGNC:3223|||||||||||||||,deletion|non_coding_transcript_exon_variant&intron_variant&non_coding_transcript_variant&feature_truncation|MODIFIER|EFNA3|ENSG00000143590|Transcript|ENST00000498667|processed_transcript|3/4|3/3|||260-?|||||||1|||HGNC|HGNC:3223|||||||||||||||,deletion|coding_sequence_variant&intron_variant&feature_truncation|MODIFIER|RP11-540D14.8|ENSG00000251246|Transcript|ENST00000505139|protein_coding|4/5|4/4|||642-?|566-?|189-?|||||1|||Clone_based_vega_gene||||||||||||||||    GT:AD:DP:DP4:FREQ:RD    ./.    ./.    0/0:0:65:45,20,0,0:
susannasiebert commented 7 years ago

The other weirdness with the above CSQ annotation is that the allele portion is not the VEP-version of the actual alternate allele but the string deletion. Another "allele" I've encountered is RPL. I'm not sure how VEP makes that determination but it's not something that my converter is currently able to handle.

susannasiebert commented 7 years ago

Not all indels are affected by this problem.

susannasiebert commented 7 years ago

I sent an email to the Ensembl help desk (http://www.ensembl.org/Multi/Help/Contact). Their helpdesk ticketing system is private but I will report back with their reply.

jasonwalker80 commented 7 years ago

For historical reference, https://jira.gsc.wustl.edu/browse/CI-49