bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
986 stars 354 forks source link

varscan variant calling failed at mpileup #382

Closed biocyberman closed 10 years ago

biocyberman commented 10 years ago

This is the command I extracted from error messages. When reruning the command manually with samtools 0.1.19, it gave "floating point exception" error.

  /usr/local/bin/samtools mpileup\
 -f /usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37wd/seq/GRCh37wd.fa\
 -d 1000\
 -L 1000\
 -m 3\
 -F 0.0002\
 -l /work_test_dup/varscan/12/EXOME_1-1-1-12_53646573_55248455-prep-variants-BATCH_1-raw-regions.bed\
  /work_test_dup/bamprep/EXOME_1/12/EXOME_1-1-1-12_53646573_55248455-prep.bam\
  /work_test_dup/bamprep/EXOME_2/12/EXOME_2-2-1-12_53646573_55248455-prep.bam\
  /work_test_dup/bamprep/EXOME_3/12/EXOME_3-3-1-12_53646573_55248455-prep.bam\
  /work_test_dup/bamprep/EXOME_4/12/EXOME_4-4-1-12_53646573_55248455-prep.bam 
 | grep
 -v
 -P '   0       $' > /work_test_dup/varscan/12/tx/tmpV1g4dF/tx/tmpTJQxKT/EXOME_1-1-1-12_53646573_55248455-prep-variants-BATCH_1-raw.mpileup
chapmanb commented 10 years ago

Thanks for the report. This looks like a bug in samtools, rather than bcbio. I don't see anything obviously wrong with the command line. You might want to try reducing down to a single small test case and then testing with the latest samtools release candidate to see if it still fails.

Generally I don't use samtools for calling since FreeBayes and GATK are as good with SNPs and much better with indels. I was going to re-evaluate this following the samtools 0.20 release but my current suggestion would be to use a different caller and avoid the problem altogether.

biocyberman commented 10 years ago

Thanks for your comment. It's true that samtools is not very well known as a variant caller. I am just curious to see how it does in connection with ensemble variant calling method you mentioned in your bcbio blog. I was testing it out for myself. I will not use samtools for variant calling anyway. However, here rises a problem in tesing ensemble: I don't see any combined final call set. I ran the run_tests.sh ensemble command. I then enabled caller key under ensemble of section of algorithm. But I got various errors for each caller. I guess you are aware of this, so look forward to the fixes. I the meantime, I will just combine the variants from different callers manually. Thanks

chapmanb commented 10 years ago

For ensemble calling the best approach is to use GATK UnifiedGenotyper, HaplotypeCaller and FreeBayes as the three inputs. samtools is worth revisiting when they release 0.20 but my general feeling is that you're going to go through a lot of effort trying to track down 0.19 bugs. Currently samtools 0.19 causes the ensemble approach to perform worse since the indel calls are so poor.

The caller input to ensemble calling is work in progress for a new simplified approach to ensemble calling. The required command line program is also available: https://github.com/chapmanb/bcbio.variation.recall This is still work in progress and not fully evaluated so I wouldn't recommend it on any production data. When finished it will definitely be documented and automatically available in bcbio-nextgen.

Thanks again for all the feedback.