arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
318 stars 120 forks source link

GEMINI silently discards valid VCF records when loading in multi-core mode #369

Closed lbeltrame closed 9 years ago

lbeltrame commented 9 years ago

gemini load sometimes does not load valid VCF records.

For example (annotated with snpEff: sensitive data removed):

chr17   41223076        .       T       A       0.0     PASS  

will not be loaded without any error. In particular this is a multi-sample line where only one sample is called, the rest has no calls (resulted from a join with the GATK's CombineVariants).

There are no traces of it in the GEMINI database, created with:

gemini load -t snpEff  --passonly --skip-cadd --skip-gene-tables -v variants.vcf.gz --skip-gerp-bp ../databases/variants.db

Querying:

gemini query -q 'select variant_id from variants where ref="T" and alt="A" and chrom="chr17" and start=41223075' ../databases/variants.db --cores=7

does not yield any result. In fact, there is no record with that REF and ALT in database, related to that gene and sample.

When GEMINI is run in single-core mode, the record is added instead.

arq5x commented 9 years ago

Thanks for reporting this. Just so I understand, it this a case where the input has a mixed set of records in the sense that some lines have genotypes for all samples while other lines have genotypes for only a subset of the records? In other words, the lost records are those with genotypes < total number of samples in the header?

lbeltrame commented 9 years ago

In data venerdì 23 gennaio 2015 07:48:59, Aaron Quinlan ha scritto:

subset of the records? In other words, the lost records are those with genotypes < total number of samples in the header?

The fun thing is that while this is like you say, it is not always like this. Most of the records have incomplete genotypes and they went through fine.

I suspect there's some race or strange behavior when there are multiple processes at bay (or perhaps during merging?), because nothing of the sort occurs if I run it in a single process.

arq5x commented 9 years ago

Are you still seeing this behavior?

lbeltrame commented 9 years ago

Are you still seeing this behavior?

Unfortunately I had a deadline so I just switched everything to single-core to get the results out. Hopefully I'll try to reproduce next week.

lbeltrame commented 9 years ago

Upon some investigation, I noticed that my grabix version is not up to date (0.1.2), which may be the cuase of these issues.

arq5x commented 9 years ago

Any updates on this - were you able to attribute it to grabix? I suspect that was the cause, as there was a bug fix for this in the last release of grabix

lbeltrame commented 9 years ago

Any updates on this - were you able to attribute it to grabix? I suspect

My guess is that's the case. Sorry for the noise!

brentp commented 9 years ago

This seems to be related to having a vcf.gz and a stale vcf.gz.gbi . I can't see checks in the code for when .gz is newer than .gbi . For now, in all cases I've found, it is resolved by removing the .gbi file. Though if I understand the issue correctly, we should have a fix shortly.

arq5x commented 9 years ago

Closing.