arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
318 stars 120 forks source link

silent errors filling sqlite tables #479

Closed mjsduncan closed 9 years ago

mjsduncan commented 9 years ago

greetings, gemini wizards! i've encountered a strange problem that could be a sqlite bug, a gemini (v 0.15.1) bug, or a cryptically malformed vcf, or some combo of these. the only explicit error message is that the indexing process gets "killed". the tables are missing values for many variables and one is clearly incorrect.

i started with a large but simple vcf consisting only of SNPs; the standard normalization in vt didn't change anything and VEP annotation (v80) had no errors. the VEP annotation is large, with 51 columns in the info/csq tag.

vt peek simpleSNP_VEP.vcf.gz

peek v0.5
options:     input VCF file            simpleSNP_VEP.vcf.gz

stats: no. of samples                     :        783
   no. of chromosomes                 :         24

   ========== Micro variants ==========

   no. of SNP                         :     846469
       2 alleles                      :          846469 (4.26) [685604/160865]

   ++++++ Other useful categories +++++

   ============== VNTR ===============

   no. of VNTRs                       :          0

   ======= Structural variants ========

   no. of structural variants         :          0

   ========= General summary ==========

   no. of reference records                  :          0
   no. of classified variants                :     846469
   no. of unclassified variants              :          0

   no. of VCF records                        :     846469

the loading process completed successfully, except with the output "killed" after a statement indicating the start of the indexing process:

gemini load --cores 6 -t VEP -v simpleSNP_VEP.vcf --tempdir . simpleSNP_VEP.db

... Done loading 846469 variants in 6 chunks. 2015-06-16 11:16:17 merging 6 chunks. 2015-06-16 11:16:51 merging 3 chunks. 2015-06-16 11:17:01 indexing final database. indexing all columns execpt 'gts'; to index that column, run gemini bcolz_index simpleSNP_VEP.db --cols gts loading 846469 variants for 783 samples into bcolz Killed

i ran the indexing command and watched the system monitor graphical display that is default on my currently updated ubuntu 14.04 installation:

gemini bcolz_index simpleSNP_VEP.db

indexing all columns execpt 'gts'; to index that column, run gemini bcolz_index simpleSNP_VEP.db --cols gts loading 846469 variants for 783 samples into bcolz Killed

the process filled up 16G of ram and then the entire 16G on the swap drive before the "killed" line appeared and then the process ended.

here are examples of abberent queries, written to give a sense of what's missing from the tables and what's not:

gemini query --header -q "select variant_id, chrom, start, ref, alt, gene, transcript, num_hom_ref, num_het, num_hom_alt, num_unknown, aaf, is_lof, biotype from variants" simpleSNP_VEP.db > variantsTable cat variantsTable | grep LINC01128

nothing is output.

head variantsTable

variant_id chrom start ref alt gene transcript num_hom_ref num_het num_hom_alt num_unknown aaf is_lof biotype 1 chr1 752565 G A None None 526 238 19 0 0.176245210728 None None 2 chr1 752720 A G None None 521 236 20 6 0.177606177606 None None 3 chr1 753404 C A None None 576 192 0 15 0.125 None None 4 chr1 754181 A G None None 570 194 17 2 0.145966709347 None None 5 chr1 760911 C T None None 527 237 19 0 0.175606641124 None None 6 chr1 768447 G A None None 605 167 11 0 0.120689655172 None None 7 chr1 776545 A G None None 426 305 50 2 0.259282970551 None None 8 chr1 779321 A G None None 582 182 19 0 0.140485312899 None None 9 chr1 785988 T C None None 575 188 20 0 0.145593869732 None None

the gene and transcript columns are "None" for the entire table, as well as _islof and biotype. this was confirmed by looking at output of gemini query --header -q "select * from variants"

in the variant_impact table, the gene and transcipt columns are filled, but the impact is a base(!?) and the _impactseverity is missing.

gemini query --header -q "select variant_id, gene, transcript, is_lof, exon, biotype, impact, impact_severity from variant_impacts" simpleSNP_VEP.db > variant_impactsTable cat variant_impactsTable | grep LINC01128

5 LINC01128 ENST00000445118 0 None lincRNA T None 6 LINC01128 ENST00000445118 0 None lincRNA A None 7 LINC01128 ENST00000445118 0 None lincRNA G None 8 LINC01128 ENST00000445118 0 None lincRNA G None 9 LINC01128 ENST00000445118 0 None lincRNA C None

head variant_impactsTable

gene transcript is_lof exon biotype impact impact_severity RP11-206L10.10 ENST00000435300 0 None processed_transcript A None RP11-206L10.10 ENST00000435300 0 None processed_transcript G None FAM87B ENST00000326734 0 1/2 lincRNA A None FAM87B ENST00000326734 0 2/2 lincRNA G None LINC01128 ENST00000445118 0 None lincRNA T None LINC01128 ENST00000445118 0 None lincRNA A None LINC01128 ENST00000445118 0 None lincRNA G None LINC01128 ENST00000445118 0 None lincRNA G None LINC01128 ENST00000445118 0 None lincRNA C None

a query with a join works superficially but there is a mismatch in the column headers and the columns: duplicate headers are kept but not duplicate columns.

gemini query --header -q "select v.variant_id, v.chrom, v.gene, v.transcript, v.aaf, v.is_lof, v.biotype, i.gene, i.transcript, i.is_lof, i.exon, i.biotype, i.impact, i.impact_severity, i.exon from variants v, variant_impacts i where v.variant_id=i.variant_id" simpleSNP_VEP.db > simpleSNPtest cat simpleSNPtest | grep LINC01128

5 chr1 LINC01128 ENST00000445118 0.175606641124 0 lincRNA None T None 6 chr1 LINC01128 ENST00000445118 0.120689655172 0 lincRNA None A None 7 chr1 LINC01128 ENST00000445118 0.259282970551 0 lincRNA None G None 8 chr1 LINC01128 ENST00000445118 0.140485312899 0 lincRNA None G None 9 chr1 LINC01128 ENST00000445118 0.145593869732 0 lincRNA None C None

head simpleSNPtest

variant_id chrom gene transcript aaf is_lof biotype gene transcript is_lof exon biotype impact impact_severity exon 1 chr1 RP11-206L10.10 ENST00000435300 0.176245210728 0 processed_transcript None A None 2 chr1 RP11-206L10.10 ENST00000435300 0.177606177606 0 processed_transcript None G None 3 chr1 FAM87B ENST00000326734 0.125 0 lincRNA 1/2 A None 4 chr1 FAM87B ENST00000326734 0.145966709347 0 lincRNA 2/2 G None 5 chr1 LINC01128 ENST00000445118 0.175606641124 0 lincRNA None T None 6 chr1 LINC01128 ENST00000445118 0.120689655172 0 lincRNA None A None 7 chr1 LINC01128 ENST00000445118 0.259282970551 0 lincRNA None G None 8 chr1 LINC01128 ENST00000445118 0.140485312899 0 lincRNA None G None 9 chr1 LINC01128 ENST00000445118 0.145593869732 0 lincRNA None C None

unfortunately the data is proprietary so i can't send the vcf file but i've attached the header plus the first few lines with the sample columns truncated.

thanks for all your work on this project! let me know what else i can do to help you understand what is going on.

mike d

testheader | uploaded via ZenHub

brentp commented 9 years ago

I'm looking into this.

brentp commented 9 years ago

@mjsduncan I'm working on the part about missing columns (that's unrelated to your loading).

But the loading is fixed in the commit above. You could test it with:

git clone -b inheritance-revamp https://github.com/brentp/gemini/

and we'll also be making a release soon once I address the other problem you noted here.

mjsduncan commented 9 years ago

thanks for the quick response! i installed your branch and then ran gemini bcolz_index and got the same response (swap filled up and then "killed" returned. but then i did gemini update and got:

Gemini data files updated From https://github.com/arq5x/gemini

  • branch master -> FETCH_HEAD Already up-to-date. HEAD is now at 22f8e6c Merge branch 'master' of github.com:arq5x/gemini

even though git status in the gemini repo folder gives

On branch inheritance-revamp Your branch is up-to-date with 'origin/inheritance-revamp'.

so am i still using the old version? i deleted the old repo folder but maybe this is a python thing i don't understand. or do i need to reload the vcf.gz file again for the fix to work?

thanks again for your help and have a good weekend ;)

brentp commented 9 years ago

yeah, this is probably a python issue. We'll make a release today or tomorrow (we hope); it'd be great if you could check it. thanks again for reporting.

mjsduncan commented 9 years ago

ok, sorry for the delay, i've been traveling. i re-loaded the vcf into gemini 0.16.0 without a problem, memory management was appropriate. there still seem to be errors in filling some of the columns:

again, this was loaded into version 0.16.0, i'll try again when you push 0.16.2 but i wanted to report the successful memory management during loading.

cheers! mike d

arq5x commented 9 years ago

Could you share the command you used for VEP?

mjsduncan commented 9 years ago

gemini load --cores 8 -t VEP -v wellVEP.vcf.gz --tempdir . simpleSNP_VEP.db

arq5x commented 9 years ago

I meant the command you used when running VEP itself to annotate your VCF.

mjsduncan commented 9 years ago

oops, sorry!

perl variant_effect_predictor.pl -v -i snp.vcf.gz -o snpVEP.vcf --vcf --port 3337 --buffer_size 10000 --sift b --polyphen b --humdiv --symbol --numbers --biotype --total_length --canonical --ccds --cache --regulatory --pick --pubmed --plugin CSN --fasta Homo_sapiens.GRCh37.75.dna.primary_assembly.fa --plugin Carol --plugin LoFtool --plugin LoF,human_ancestor_fa:human_ancestor.fa --plugin CADD,ExAC.r0.2.tsv.gz --plugin ExAC,ExAC.r0.3.sites.vep.vcf.gz --fork 8 (file paths removed)

the header and a couple variant lines are attached in the original post showing VEP info/CSQ tag examples.

arq5x commented 9 years ago

Looks like the issue is that VEP needs to be run using the prescribed options in the GEMINI docs: http://gemini.readthedocs.org/en/latest/content/functional_annotation.html#running-vep