bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 354 forks source link

tumor/normal calling with vardict #756

Closed rronen closed 9 years ago

rronen commented 9 years ago

Hi Brad/Rory,

Running med-high coverage (~150x) tumor/normal exome calling with multiple callers (varscan, freebayes, + trying vardict for the 1st time) -- it looks like vardict is moving much slower than varscan or freebayes. Also, I am doing this with minimal bam-prep and with full bam-prep (for benchmarking) and it seems to me that with full bam-prep vardict shows (slow) progresses, but with minimal bam-prep (dedup only) it's hardly moving at all (on chr1 for >10 hours now).

Just wanted to check if this is in-line with what you've seen? And specifically if I should avoid vardict with minimal bam-prep? The vardict documentation is pretty sparse...

*this is on AWS m3.2xlarge (-n 8), CPU at 60-99% and RAM full but not choking -- so I'd say reasonable resource usage.

mjafin commented 9 years ago

Hi @rronen, VarDict can be slow depending on a few things like long (or repeat) regions. It's best used with minimal prep (dedup) as it does indel realignment internally. You could try increasing nomap_split_targets maybe and see if it speeds things up?

As an aside, we're in the process of having it rewritten in Java (not my choice of language!) and have seen speed ups of 2x-5x. I don't have a timeline for the roll out but it should be sooner rather than later.

rronen commented 9 years ago

Hi @mjafin,

Thanks! Great to hear it's being implemented in a more performance-oriented language ;) 5x speed ups are something to look forward to!

Regarding minimal bam prep - got it. Although it makes me wonder what's driving the difference I am seeing in runtime as the only difference between the runs is minimal vs. full bam prep (can think only of external conditions related to the aws instances... although they are the same type).

For tuning performance, any tips for setting nomap_split_targets would be most welcome (working with 16-core instances on AWS). Some experience probably goes a long way here. The DREAM3 yaml has:

align_split_size: 5000000
nomap_split_targets: 100

(looks like default for nomap_split_targets is 200, so this value is probably set in my 'slow' runs, do you have experience with much higher values e.g. 500 or 1000?)

Either way, thanks for the input. Closing the ticket as it seems this isn't much of an issue.

chapmanb commented 9 years ago

Roy; Increasing nomap_split_targets splits the segmentation into smaller pieces, so each one should take less time, but the overall processing time should be similar. However, this could also help narrow down on regions that are especially slow so we can dig into them more. I'm confused as to why the minimal/full would have a big difference in time between the runs, as they should be pretty similar in terms of inputs. Are you using standard EBS for the disk, or dedicated IOPS? The former can have variability in terms of run times, especially for 2xlarges where you share the system with other users. That's the only other external factor I can think of.

Definitely happy to hear if you come up with any areas we can improve. Thanks for looking at this.

rronen commented 9 years ago

Brad, Miika,

Thanks. Yes actually -- I am using standard EBS on 2xlarge machines. This gives me something to think about. I'll look at using IOPS and 4xlarges.

About the overall compute time -- Miika seemed to suggest that VarDict might behave better with shorter regions, perhaps due to memory usage? Although, if it's streaming from samtools and internally realigns/evaluates haplotypes than probably it has its own set 'window size' that it considers at a given time.. which I guess wouldn't be affected by the size of the overall input region to call.

So, until proven otherwise I will assume it's EBS/2xlarge misbehaving. Especially since I am getting much nicer behavior from VarDict on the DREAM3 dataset which I ran on 4xlarge (even with standard EBS).

Thanks! R.

rronen commented 9 years ago

Hi @mjafin,

A couple things to mention, and one question:

1) On benchmark datasets, vardict's performance (sensitivity & specificity) for both SNPs and INDELs is consistently coming out on top (considering freely available, aka non-MuTect) tumor/normal callers. Perhaps excluding an ensemble, but actually not far from that either.

2) At the same time, it's running time is consistently very much slower compared to the other callers (varscan/freebayes), even when setting nomap_split_targets. This is especially true in real-life exome datasets with occasional high-coverage regions, etc.

All this to say, it'd be really great to have a "fast vardict" :)

Now the question: is the intent to have the new (Java?) implementation also freely available like the current one (e.g. MIT license or similar)?

Thanks! -Roy

EDIT: also, any idea (even very rough) on when to expect?

zhongwulai commented 9 years ago

Hi Roy,

Thanks for trying out VarDict. Great to hear that VarDict coming out on top for your benchmark datasets.

Can you give me the time for your to run an exome (# of cores, and total time) and how you run it? I just tested it today using our server and same version of VarDict in GitHub. For a single sample mode starting from BAM files, using 8 cores against 35.8Mb of hg19 coding sequences of ~100x exome, it took ~2.5 hrs/sample. For paired mode of 100x exomes, it took ~6 hrs/sample. And our servers are at least 5 years old. So if you use all 16-cores, the time should be about half of that. So I'm surprised that it ran that slow for you. I designed VarDict to run against a BED file, with memory linear to segment sizes, and performance linear to the depth. In fact, one of the advantages of VarDict is it can handle coverage up to millions without downsampling and no significant performance issues.

For BAM files, it can take straight after alignment, no de-dup or local realignment is needed and it can do both on the fly. And no need to split BAM files either. If you want to de-dup for hybrid capture based sequencing, supply "-t" option and VarDict will de-dup on the fly, using algorithm similar to Picard. VarDict not only is good at calling indels, but it also has the most accurate estimation of allele frequencies.

If you already have BAM files and have SNPEff, you can try runVarDict_BAM.sh script, which uses 8 cores. But you might need to modify it to reflect your SNPEff installation path, as well as VCF files from dbSNP, ClinVar, and COSMIC. After that, the vcf2txt.pl script can convert VCF files back to MAF like format, with additional capability to do cohort based filtering and annotation.

Again, thanks for using VarDict and please continue to let us know your suggestions.

Zhongwu

mjafin commented 9 years ago

Hi @rronen and @zhongwulai, I guess a distinction must be made between VarDict standalone and VarDict run in bcbio. VarDict is integrated in bcbio much in the same way Zhongwu describes the standalone usage above. When running in AWS several factors (type of storage used etc) can affect VarDict's speed but if you're also simultaneously running FreeBayes and VarScan the relative speed shouldn't change.

The Java version of VarDict is still in a private repository but I'll let you know when we're ready to incorporate it in bcbio (need to double check it's still the plan).

rronen commented 9 years ago

Hi @zhongwulai and @mjafin,

Thanks for your replies.

Right, I'm using vardict in bcbio rather than standalone, so have less fine-grained control over how it is run. But if it's integrated similarly as @mjafin mentioned that's probably not the source of performance issues.

I am on 16 core AWS instances (c3.4xlarge) doing paired tumor/normal with ~150x. I am not sure exactly how long it is taking (will try to dig around bcbio-nextgen.log to get the exact timing) but it is on the order of days (~48 hours/sample) which is very different from what you observe.

I have a feeling maybe this is related somehow to the BED file. I am currently using bcbio's coverage_interval: exome option rather than explicitly passing in a file with the intervals. I am doing this under the assumption that it's (at most) a super-set that includes all known exons. Perhaps this is misguided and is causing the performance issues? (@mjafin or @chapmanb: any input welcome!).

Thanks again, Roy

P.S. as per @mjafin's advice, I am setting nomap_split_targets (to 100 or 200).

mjafin commented 9 years ago

Hi @rronen, my understanding is that the coverage_interval option doesn't currently limit calling into any region but is used for certain Gatk related options (@chapmanb can probably chime in here). I thus believe that (without an explicit bed file) you're actually calling variants anywhere in the genome where there is some coverage available, thus slowing down variant calling.

Do you have a bed file for the exome kit at hand at all? Try using one - I'm pretty sure you'll get to the end quicker.

chapmanb commented 9 years ago

Roy; Miika is right on. It might help to set variant_regions to the BED file of your target regions. Otherwise bcbio will call anywhere with coverage. Off-target regions with coverage are enriched with repeats so you can end up with slower and less reliable calls. coverage_interval doesn't do any magic setting of regions so using the BED file is definitely recommended. Sorry for the confusion: perhaps we should drop coverage_interval and attempt to infer it from the variant_regions and actual coverage calculations?

Hope adding in the BED files helps sort this out.

rronen commented 9 years ago

Hi @chapmanb and @mjafin,

Thanks. I'm not even sure why I was under the assumption that coverage_interval: exome has some magical behavior of limiting to all known exons. I will get the correct BED for the exome kit and give it a spin with it.

Thanks again for clearing this up.

Roy