bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 354 forks source link

Pipeline saves mistakenly saves GEMINI VCF files as .db #139

Closed lbeltrame closed 11 years ago

lbeltrame commented 11 years ago

I've noticed in the last run that the pipeline saved VCF files in the output directory with the .db extension, while I expected them to be the GEMINI databases produced after the run.

It turned out that due to an install error, GEMINI was installed but couldn't run (some depending libraries had to be reinstalled). This meant that it errored out, but the pipeline went ahead.

In other words, GEMINI databases weren't created and at the same time the VCFs were mistakenly saved with the .db extension.

chapmanb commented 11 years ago

Luca; Thanks for the report. Do you know how exactly gemini was broken? If so, I can break a working instance and hopefully reproduce to try and track down the problem.

lbeltrame commented 11 years ago

Thanks for the report. Do you know how exactly gemini was broken? If so, I

The local installation of bx-python was broken and couldn't find BigWigFile inside the the package, so gemini started and quit with an exception immediately.

I guess you could just try to generate an exception on startup (even a SyntaxError would suffice, IMO). Notice that there are no errors printed, and the gemini command isn't even displayed in the command log.

I'm now rerunning the whole pipeline (after spending 3 days chasing VarScan bugs groan) and see whether this happens again or not.

lbeltrame commented 11 years ago

Nope, it's not that. I reran the pipeline and again gemini was not run and VCF files ended up as .db.

EDIT: It's worth mentioning also that the run failed midway, and was restarted (the second time with no errors). I'll open a separate issue for the failure in casue it occurs again.

lbeltrame commented 11 years ago

Well, it looks like my original assumption was totally wrong, and there is in fact a bug in the code.

At line 46 in variation/population.py:

gemini_vcf = os.path.join(out_dir, "%s-%s.db" % (name, caller))

I assume it is meant to be ".vcf" and not .db (this will be used for a GATK call), because then in prep_gemini_db we have (line 32):

if use_gemini and not utils.file_exists(gemini_db):  

but the file does exist due to the wrong filename, and so GEMINI isn't ever run.

chapmanb commented 11 years ago

Luca; Thanks for finding this one. Apologies, I made this mistake while re-working the Ensemble variant calling that reuses part of the Gemini functionality. I'm working on testing this with a large population this week to try and iron out any other issues but please let me know if you find anything else. Apologies again and thank you.