Closed lbeltrame closed 10 years ago
Luca; I'm open to options here if you have time to benchmark. I've done some benchmarking on the VCF manipulation and the most problem area in my experience was merging the multiple VCFs into a single final VCF. I addressed this with a custom vcf concatenate function:
https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/variation/vcfutils.py#L182
Merging overlapping files is tricker and didn't seem to be a big performance hog, which is why I stuck with GATK for correctness. Happy to try other things if you end up finding an approach that is faster or more correct.
Luca; We've been evaluating options and are currently planning to investigate the new samtools bcftools, which has support for merging and evaluating VCF files:
http://samtools.github.io/bcftools/
It's a bit tricky from the build side since the location of executables changes (tabix is now in bcftools, bcftools is no longer part of samtools, bgzip is now in samtools) but I have some basic formula in place. We'll work to handle things internally as gzipped/tabix indexed files combined using bcftools and will report her as we make progress.
Finalized in 0.7.8. See #294
While the GATK is quite useful for analysis and recalibration, at least in some occasions in the pipeline (for example VarScan 2 paired calling) it is called repeatedly just to join the SNPs and indels found by the variant calller.
This introduces quite some overhead, as it's called once for each parallel analysis block, resulting in hundreds of calls in total.
In light of this post probably this part (variant file joining) can be optimized as well.
If faster alternatives are present (I think vcflib provides some, but I didn't test their accuracy) they should be used instead to speed up what is basically a rather menial task during the pipeline.