Closed mjsduncan closed 9 years ago
I'm looking into this.
@mjsduncan I'm working on the part about missing columns (that's unrelated to your loading).
But the loading is fixed in the commit above. You could test it with:
git clone -b inheritance-revamp https://github.com/brentp/gemini/
and we'll also be making a release soon once I address the other problem you noted here.
thanks for the quick response! i installed your branch and then ran gemini bcolz_index
and got the same response (swap filled up and then "killed" returned. but then i did gemini update
and got:
Gemini data files updated From https://github.com/arq5x/gemini
- branch master -> FETCH_HEAD Already up-to-date. HEAD is now at 22f8e6c Merge branch 'master' of github.com:arq5x/gemini
even though git status
in the gemini repo folder gives
On branch inheritance-revamp Your branch is up-to-date with 'origin/inheritance-revamp'.
so am i still using the old version? i deleted the old repo folder but maybe this is a python thing i don't understand. or do i need to reload the vcf.gz file again for the fix to work?
thanks again for your help and have a good weekend ;)
yeah, this is probably a python issue. We'll make a release today or tomorrow (we hope); it'd be great if you could check it. thanks again for reporting.
ok, sorry for the delay, i've been traveling. i re-loaded the vcf into gemini 0.16.0 without a problem, memory management was appropriate. there still seem to be errors in filling some of the columns:
again, this was loaded into version 0.16.0, i'll try again when you push 0.16.2 but i wanted to report the successful memory management during loading.
cheers! mike d
Could you share the command you used for VEP?
gemini load --cores 8 -t VEP -v wellVEP.vcf.gz --tempdir . simpleSNP_VEP.db
I meant the command you used when running VEP itself to annotate your VCF.
oops, sorry!
perl variant_effect_predictor.pl -v -i snp.vcf.gz -o snpVEP.vcf --vcf --port 3337 --buffer_size 10000 --sift b --polyphen b --humdiv --symbol --numbers --biotype --total_length --canonical --ccds --cache --regulatory --pick --pubmed --plugin CSN --fasta Homo_sapiens.GRCh37.75.dna.primary_assembly.fa --plugin Carol --plugin LoFtool --plugin LoF,human_ancestor_fa:human_ancestor.fa --plugin CADD,ExAC.r0.2.tsv.gz --plugin ExAC,ExAC.r0.3.sites.vep.vcf.gz --fork 8
(file paths removed)
the header and a couple variant lines are attached in the original post showing VEP info/CSQ tag examples.
Looks like the issue is that VEP needs to be run using the prescribed options in the GEMINI docs: http://gemini.readthedocs.org/en/latest/content/functional_annotation.html#running-vep
greetings, gemini wizards! i've encountered a strange problem that could be a sqlite bug, a gemini (v 0.15.1) bug, or a cryptically malformed vcf, or some combo of these. the only explicit error message is that the indexing process gets "killed". the tables are missing values for many variables and one is clearly incorrect.
i started with a large but simple vcf consisting only of SNPs; the standard normalization in vt didn't change anything and VEP annotation (v80) had no errors. the VEP annotation is large, with 51 columns in the info/csq tag.
vt peek simpleSNP_VEP.vcf.gz
the loading process completed successfully, except with the output "killed" after a statement indicating the start of the indexing process:
gemini load --cores 6 -t VEP -v simpleSNP_VEP.vcf --tempdir . simpleSNP_VEP.db
i ran the indexing command and watched the system monitor graphical display that is default on my currently updated ubuntu 14.04 installation:
gemini bcolz_index simpleSNP_VEP.db
the process filled up 16G of ram and then the entire 16G on the swap drive before the "killed" line appeared and then the process ended.
here are examples of abberent queries, written to give a sense of what's missing from the tables and what's not:
gemini query --header -q "select variant_id, chrom, start, ref, alt, gene, transcript, num_hom_ref, num_het, num_hom_alt, num_unknown, aaf, is_lof, biotype from variants" simpleSNP_VEP.db > variantsTable
cat variantsTable | grep LINC01128
nothing is output.
head variantsTable
the gene and transcript columns are "None" for the entire table, as well as _islof and biotype. this was confirmed by looking at output of
gemini query --header -q "select * from variants"
in the variant_impact table, the gene and transcipt columns are filled, but the impact is a base(!?) and the _impactseverity is missing.
gemini query --header -q "select variant_id, gene, transcript, is_lof, exon, biotype, impact, impact_severity from variant_impacts" simpleSNP_VEP.db > variant_impactsTable
cat variant_impactsTable | grep LINC01128
head variant_impactsTable
a query with a join works superficially but there is a mismatch in the column headers and the columns: duplicate headers are kept but not duplicate columns.
gemini query --header -q "select v.variant_id, v.chrom, v.gene, v.transcript, v.aaf, v.is_lof, v.biotype, i.gene, i.transcript, i.is_lof, i.exon, i.biotype, i.impact, i.impact_severity, i.exon from variants v, variant_impacts i where v.variant_id=i.variant_id" simpleSNP_VEP.db > simpleSNPtest
cat simpleSNPtest | grep LINC01128
head simpleSNPtest
unfortunately the data is proprietary so i can't send the vcf file but i've attached the header plus the first few lines with the sample columns truncated.
thanks for all your work on this project! let me know what else i can do to help you understand what is going on.
mike d
testheader | uploaded via ZenHub