Closed naumenko-sa closed 7 years ago
Hi Sergey, what version of gemini are you using? Can you show some output of your query? When running this on test.query.db from the gemini test-suite, I see, e.g.:
chr1|69270|69270|A|G|OR4F5|OR4F5|ENST00000335137|ENST00000335137|S60|305
chr1|69428|69428|T|G|OR4F5|OR4F5|ENST00000335137|ENST00000335137|F113C|305
chr1|69511|69511|A|G|OR4F5|OR4F5|ENST00000335137|ENST00000335137|T141A|305
chr1|69761|69761|A|T|OR4F5|OR4F5|ENST00000335137|ENST00000335137|D224V|305
chr1|69871|69871|G|A|OR4F5|OR4F5|ENST00000335137|ENST00000335137|D261N|305
chr1|69897|69897|T|C|OR4F5|OR4F5|ENST00000335137|ENST00000335137|S269|305
where all transcripts match, likely because there is only 1. at that location, but later, I see:
chr1|874665|874665|G|A|SAMD11|SAMD11|ENST00000420190|ENST00000420190|L177|178
chr1|874665|874665|G|A|SAMD11|SAMD11|ENST00000420190|ENST00000342066|L177|681
chr1|874665|874665|G|A|SAMD11|SAMD11|ENST00000420190|ENST00000478729||
chr1|874665|874665|G|A|SAMD11|SAMD11|ENST00000420190|ENST00000464948||
chr1|874665|874665|G|A|SAMD11|SAMD11|ENST00000420190|ENST00000466827||
chr1|874665|874665|G|A|SAMD11|SAMD11|ENST00000420190|ENST00000474461||
where nearly all transcripts are different.
Hi Brent!
I'm using gemini 0.19.1 installed by bcbio.
gemini --version
gemini 0.19.1
I see just 1 impact per gene (all impacts, coding and non-coding):
select count(*) from variant_impacts;
144942
chr1 860521 860521 C A SAMD11 SAMD11 ENST00000342066 ENST00000342066 -/681
chr1 861630 861630 G A SAMD11 SAMD11 ENST00000342066 ENST00000342066 -/681
chr1 866319 866319 G A SAMD11 SAMD11 ENST00000342066 ENST00000342066 -/681
chr1 866511 866511 C CCCCT SAMD11 SAMD11 ENST00000342066 ENST00000342066 -/681
chr1 871334 871334 G T SAMD11 SAMD11 ENST00000342066 ENST00000342066 -/681
chr1 874950 874950 T TCCCTGGAGGACC SAMD11 SAMD11 ENST00000342066 ENST00000342066 -/681
chr1 876499 876499 A G SAMD11 SAMD11 ENST00000342066 ENST00000342066 -/681
It is when I'm using VEP annotation with HGVSc,HGVSp. When using SNPeff I see many impacts for a gene:
select count(*) from variant_impacts;
737809
chr1 866319 866319 G A SAMD11 SAMD11 ENST00000437963 ENST00000437963 -/109
chr1 866319 866319 G A SAMD11 SAMD11 ENST00000437963 ENST00000342066 -/681
chr1 866319 866319 G A SAMD11 SAMD11 ENST00000437963 ENST00000341065 -/589
chr1 866319 866319 G A SAMD11 SAMD11 ENST00000437963 ENST00000420190 -/179
I need both VEP and impacts for all transcripts. Thanks, Sergey
I'm having trouble understanding the problem. Can you share a small VCF that I can use to recreate?
Sure, https://drive.google.com/open?id=0B_bLL10GwDnsem9BS2tIY2tvbW8
As far as I see, the problem is that VEP annotated many impacts of a variant (for each transcript), but only one effect per variant was loaded into variant_impacts table.
Look at the first variant in the vcf file.
gunzip -c NA12878-1-ensemble-decompose.head1000.vcf.gz | grep -v "^#" | head -n1
1 14653 . C T 226.95 PASS AB=0.473684;ABP=3.35316;AC=1;AF=0.5;AN=2;ANN=T|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|transcript|ENST00000456328|processed_transcript||n.*244C>T|||||244|,T|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|transcript|ENST00000515242|transcribed_unprocessed_pseudogene||n.*241C>T|||||241|,T|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|transcript|ENST00000518655|transcribed_unprocessed_pseudogene||n.*244C>T|||||244|,T|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|transcript|ENST00000450305|transcribed_unprocessed_pseudogene||n.*983C>T|||||983|,T|intron_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000488147|unprocessed_pseudogene|10/10|n.1254-152G>A||||||,T|intron_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000538476|unprocessed_pseudogene|12/12|n.1492-151G>A||||||,T|non_coding_transcript_exon_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000438504|unprocessed_pseudogene|12/12|n.1493G>A||||||,T|non_coding_transcript_exon_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000541675|unprocessed_pseudogene|9/9|n.1126G>A||||||,T|non_coding_transcript_exon_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000423562|unprocessed_pseudogene|10/10|n.1379G>A||||||;AO=27;CALLERS=freebayes,samtools;CIGAR=1X;DP=57;DPB=57;DPRA=0;EPP=16.6021;EPPR=3.08518;GTI=0;LEN=1;MEANALT=2;MQM=23.5556;MQMR=22.4828;NS=1;NUMALT=1;ODDS=52.2581;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=886;QR=899;RO=29;RPL=1;RPP=53.2759;RPPR=6.67934;RPR=26;RUN=1;SAF=21;SAP=21.1059;SAR=6;SRF=14;SRP=3.08518;SRR=15;TYPE=snp;technology.illumina=1 GT:GQ:DP:RO:QR:AO:QA:GL 0/1:145:57:29:899:27:886:-37.2171,0,-35.1948
In the database:
select * from variant_impacts where variant_id=1
...> ;
1|1|WASH7P|ENST00000438504|0|0|0|12/12||||unprocessed_pseudogene|non_coding_transcript_exon_variant|non_coding_transcript_exon_variant|LOW|||||YES|||||ENST00000438504.2:n.1493G>A|
Loading command was:
[2016-11-28T23:09Z] export PATH=/hpf/largeprojects/ccmbio/naumenko/tools/bcbio/anaconda/bin:$PATH && /home/naumenko/work/tools/bcbio/anaconda/bin/gemini load --passonly --skip-gerp-bp -v /hpf/largeprojects/ccmbio/naumenko/validation/NA12878-exome-eval/work/gemini/NA12878-1-ensemble-decompose-vepeffects.vcf.gz -t VEP --cores 5 --tempdir /hpf/largeprojects/ccmbio/naumenko/validation/NA12878-exome-eval/work/gemini/tx/tmpVnnn6i /hpf/largeprojects/ccmbio/naumenko/validation/NA12878-exome-eval/work/gemini/tx/tmpVnnn6i/NA12878-1-ensemble.db
Thanks, Sergey
that VCF was annotated with snpEff, not VEP. I suspect you'll have a better result if you specify -t snpEff
I did this:
gemini load -t snpEff -p issue798.ped -v $vcf issue798.db
and had variant_impacts filled with difference transcripts for your example vcf.
Thanks Brent. You are right - this vcf was annotated with snpEFF, it had ANN field rather than CSQ. Sorry, it was my fault. I was not able to share real samples with you, so I tried NA12878, which was processed with snpEFF.
Anyway, my VEP annotated vcf's contain just 1 effect per variant. But this is not because of GEMINI, I discovered that the pipeline uses --pick option of VEP http://useast.ensembl.org/info/docs/tools/vep/script/vep_options.html, which forces VEP to write exactly one effect per variant, and then GEMINI loads this into the database.
Thank you for you time spent on this issue, finally you helped me to resolve it.
Sergey
Good morning!
Thanks for the excellent framework.
I wanted to extract all of the impacts for the mutations, including all of the transcripts affected.
It tried the query:
And it seems that variant_impacts.transcript has the same value for all of the transcripts of a particular gene - it is the value of variants.transcript. However aa_change and aa_length are different for different transcript, as it should be.
How to fix the transcript names, or what am I doing wrong?
Sergey