arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
318 stars 120 forks source link

Problems with variant_impacts.transcript field #798

Closed naumenko-sa closed 7 years ago

naumenko-sa commented 7 years ago

Good morning!

Thanks for the excellent framework.

I wanted to extract all of the impacts for the mutations, including all of the transcripts affected.

It tried the query:

select v.chrom, v.start+1,v.end, v.ref, v.alt, v.gene,vi.gene, v.transcript, vi.transcript,
                             vi.aa_change,vi.aa_length 
                             from variant_impacts vi, variants v 
                             where vi.variant_id=v.variant_id and  v.is_coding = 1

And it seems that variant_impacts.transcript has the same value for all of the transcripts of a particular gene - it is the value of variants.transcript. However aa_change and aa_length are different for different transcript, as it should be.

How to fix the transcript names, or what am I doing wrong?

Sergey

brentp commented 7 years ago

Hi Sergey, what version of gemini are you using? Can you show some output of your query? When running this on test.query.db from the gemini test-suite, I see, e.g.:

chr1|69270|69270|A|G|OR4F5|OR4F5|ENST00000335137|ENST00000335137|S60|305
chr1|69428|69428|T|G|OR4F5|OR4F5|ENST00000335137|ENST00000335137|F113C|305
chr1|69511|69511|A|G|OR4F5|OR4F5|ENST00000335137|ENST00000335137|T141A|305
chr1|69761|69761|A|T|OR4F5|OR4F5|ENST00000335137|ENST00000335137|D224V|305
chr1|69871|69871|G|A|OR4F5|OR4F5|ENST00000335137|ENST00000335137|D261N|305
chr1|69897|69897|T|C|OR4F5|OR4F5|ENST00000335137|ENST00000335137|S269|305

where all transcripts match, likely because there is only 1. at that location, but later, I see:

chr1|874665|874665|G|A|SAMD11|SAMD11|ENST00000420190|ENST00000420190|L177|178
chr1|874665|874665|G|A|SAMD11|SAMD11|ENST00000420190|ENST00000342066|L177|681
chr1|874665|874665|G|A|SAMD11|SAMD11|ENST00000420190|ENST00000478729||
chr1|874665|874665|G|A|SAMD11|SAMD11|ENST00000420190|ENST00000464948||
chr1|874665|874665|G|A|SAMD11|SAMD11|ENST00000420190|ENST00000466827||
chr1|874665|874665|G|A|SAMD11|SAMD11|ENST00000420190|ENST00000474461||

where nearly all transcripts are different.

naumenko-sa commented 7 years ago

Hi Brent!

I'm using gemini 0.19.1 installed by bcbio.

gemini --version
gemini 0.19.1

I see just 1 impact per gene (all impacts, coding and non-coding):

select count(*) from variant_impacts;
144942
chr1    860521  860521  C   A   SAMD11  SAMD11  ENST00000342066 ENST00000342066 -/681
chr1    861630  861630  G   A   SAMD11  SAMD11  ENST00000342066 ENST00000342066 -/681
chr1    866319  866319  G   A   SAMD11  SAMD11  ENST00000342066 ENST00000342066 -/681
chr1    866511  866511  C   CCCCT   SAMD11  SAMD11  ENST00000342066 ENST00000342066 -/681
chr1    871334  871334  G   T   SAMD11  SAMD11  ENST00000342066 ENST00000342066 -/681
chr1    874950  874950  T   TCCCTGGAGGACC   SAMD11  SAMD11  ENST00000342066 ENST00000342066 -/681
chr1    876499  876499  A   G   SAMD11  SAMD11  ENST00000342066 ENST00000342066 -/681

It is when I'm using VEP annotation with HGVSc,HGVSp. When using SNPeff I see many impacts for a gene:

select count(*) from variant_impacts;
737809
chr1    866319  866319  G   A   SAMD11  SAMD11  ENST00000437963 ENST00000437963 -/109
chr1    866319  866319  G   A   SAMD11  SAMD11  ENST00000437963 ENST00000342066 -/681
chr1    866319  866319  G   A   SAMD11  SAMD11  ENST00000437963 ENST00000341065 -/589
chr1    866319  866319  G   A   SAMD11  SAMD11  ENST00000437963 ENST00000420190 -/179

I need both VEP and impacts for all transcripts. Thanks, Sergey

brentp commented 7 years ago

I'm having trouble understanding the problem. Can you share a small VCF that I can use to recreate?

naumenko-sa commented 7 years ago

Sure, https://drive.google.com/open?id=0B_bLL10GwDnsem9BS2tIY2tvbW8

As far as I see, the problem is that VEP annotated many impacts of a variant (for each transcript), but only one effect per variant was loaded into variant_impacts table.

Look at the first variant in the vcf file.

gunzip -c NA12878-1-ensemble-decompose.head1000.vcf.gz | grep -v "^#" | head -n1
1   14653   .   C   T   226.95  PASS    AB=0.473684;ABP=3.35316;AC=1;AF=0.5;AN=2;ANN=T|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|transcript|ENST00000456328|processed_transcript||n.*244C>T|||||244|,T|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|transcript|ENST00000515242|transcribed_unprocessed_pseudogene||n.*241C>T|||||241|,T|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|transcript|ENST00000518655|transcribed_unprocessed_pseudogene||n.*244C>T|||||244|,T|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|transcript|ENST00000450305|transcribed_unprocessed_pseudogene||n.*983C>T|||||983|,T|intron_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000488147|unprocessed_pseudogene|10/10|n.1254-152G>A||||||,T|intron_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000538476|unprocessed_pseudogene|12/12|n.1492-151G>A||||||,T|non_coding_transcript_exon_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000438504|unprocessed_pseudogene|12/12|n.1493G>A||||||,T|non_coding_transcript_exon_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000541675|unprocessed_pseudogene|9/9|n.1126G>A||||||,T|non_coding_transcript_exon_variant|MODIFIER|WASH7P|ENSG00000227232|transcript|ENST00000423562|unprocessed_pseudogene|10/10|n.1379G>A||||||;AO=27;CALLERS=freebayes,samtools;CIGAR=1X;DP=57;DPB=57;DPRA=0;EPP=16.6021;EPPR=3.08518;GTI=0;LEN=1;MEANALT=2;MQM=23.5556;MQMR=22.4828;NS=1;NUMALT=1;ODDS=52.2581;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=886;QR=899;RO=29;RPL=1;RPP=53.2759;RPPR=6.67934;RPR=26;RUN=1;SAF=21;SAP=21.1059;SAR=6;SRF=14;SRP=3.08518;SRR=15;TYPE=snp;technology.illumina=1 GT:GQ:DP:RO:QR:AO:QA:GL 0/1:145:57:29:899:27:886:-37.2171,0,-35.1948

In the database:

select * from variant_impacts where variant_id=1
   ...> ;
1|1|WASH7P|ENST00000438504|0|0|0|12/12||||unprocessed_pseudogene|non_coding_transcript_exon_variant|non_coding_transcript_exon_variant|LOW|||||YES|||||ENST00000438504.2:n.1493G>A|

Loading command was:

[2016-11-28T23:09Z] export PATH=/hpf/largeprojects/ccmbio/naumenko/tools/bcbio/anaconda/bin:$PATH &&  /home/naumenko/work/tools/bcbio/anaconda/bin/gemini  load  --passonly --skip-gerp-bp  -v /hpf/largeprojects/ccmbio/naumenko/validation/NA12878-exome-eval/work/gemini/NA12878-1-ensemble-decompose-vepeffects.vcf.gz -t VEP --cores 5 --tempdir /hpf/largeprojects/ccmbio/naumenko/validation/NA12878-exome-eval/work/gemini/tx/tmpVnnn6i /hpf/largeprojects/ccmbio/naumenko/validation/NA12878-exome-eval/work/gemini/tx/tmpVnnn6i/NA12878-1-ensemble.db

Thanks, Sergey

brentp commented 7 years ago

that VCF was annotated with snpEff, not VEP. I suspect you'll have a better result if you specify -t snpEff

I did this:

gemini load -t snpEff -p issue798.ped -v $vcf issue798.db

and had variant_impacts filled with difference transcripts for your example vcf.

naumenko-sa commented 7 years ago

Thanks Brent. You are right - this vcf was annotated with snpEFF, it had ANN field rather than CSQ. Sorry, it was my fault. I was not able to share real samples with you, so I tried NA12878, which was processed with snpEFF.

Anyway, my VEP annotated vcf's contain just 1 effect per variant. But this is not because of GEMINI, I discovered that the pipeline uses --pick option of VEP http://useast.ensembl.org/info/docs/tools/vep/script/vep_options.html, which forces VEP to write exactly one effect per variant, and then GEMINI loads this into the database.

Thank you for you time spent on this issue, finally you helped me to resolve it.

Sergey