Closed amizeranschi closed 5 years ago
Hi @amizeranschi!
I think joint genotyping of many samples and ensemble calling are two different stories. You either do joint genotyping for > 100 samples with gatk or freebayes, or combine callers for a few samples maximum. With many variant callers you can tune precision/sensitivity when you are working with individual samples or trios. Joint calling is for very large cohorts - here you are using just one tool to call, and then combine calls with joint genotyping.
Sergey
Hi Sergey,
Thanks for your answer, but I'm not sure I understand 100%. Why can't we have both joint genotyping and creating an ensemble VCF file as well, at the end?
Would it not make sense to run joint genotyping with two callers (e.g. gatk and freebayes), produce a joint VCF file from each caller, and then create a consensus (ensemble) VCF file from those single-caller VCF files? Wouldn't this be similar to what was done in the past with ensemble calling on pooled samples, before joint genotyping was introduced for large cohorts (>100 samples)?
I forgot to add, this setup (joint VC with ensemble VCF creation at the end) is already working in bcbio when not using CWL. The errors reported here only seem to happen when using CWL.
The only apparent problem with joint genotyping + ensemble calling (when not using CWL) is that the VCF flag CALLERS
also lists the joint callers along with the variant callers (e.g. strelka2-joint
along with strelka2
), as I mentioned in https://github.com/bcbio/bcbio-nextgen/issues/2688#issue-412394726. It would be great if that could be fixed as well, so that the joint callers don't get listed.
Thanks for this really great discussion. I'm agreed with Sergey's assessment that joint calling + ensemble is not a current focus of bcbio. Practically, I'd recommend sticking with a single joint calling germline approach with GATK4 for a few reasons:
Is ensemble joint calling providing variants that a single GATK4 run is missing? If so, it would be nice to formalize this and figure out how best to support and improve it. Thanks again for the thoughts and suggestions.
Thanks for the reply. In past experiments I have found variants that seemed to be true positives and were missed by individual tools (incl. HaplotypeCaller), but found by others and an ensemble approach with e.g. 3 out of 5 tools produced a better set of high-confidence variants compared with what each individual variant caller came up with. I'm afraid I don't have a reference or any more details about this at the moment, but this is the reason I was interested in the multi-VC, ensemble approach.
Would there be a need for additional validation to the ensemble approach when using joint calling instead of the old population calling? What would this require, more exactly?
I see your point about the runtime and scalability issues when calling variants on a large numbers of samples with multiple tools. However, I was hoping to leave this worry up to the researchers themselves. For those with enough time and resources on their hands, an ensemble approach might just provide a set of higher-confidence variants when identified with multiple tools, compared to putting all your "trust" into a single tool.
As a side question, what would be the limit in terms of number of samples, for using population calling instead of joint calling? The bcbio documentation states Joint calling is only needed for larger input sample sizes (>100 samples), otherwise use standard pooled Population calling
.
Would bcbio using CWL be able to parallelize population calling for a large number of samples, using the two settings nomap_split_size
and nomap_split_targets
? What could be a good strategy for doing this? And would ensemble, multi-tool calling still work with standard population calling instead of joint calling?
Thanks for all this helpful discussion. I definitely agree that there is potential benefit to ensemble methods. So far we haven't had a good dataset with truth calls to demonstrate that it helps enough to overcome the investment in both computational time running multiple callers and development time in tuning and tweaking ensemble calling on large runs. Ensemble output also has the practical downside of having non-harmonized VCFs since the calls come from multiple callers, which requires additional work. We just haven't had the time yet to validate and confirm larger population ensemble calling with CWL; hope this helps explain the current state.
Pooled calling might work better practically for ensemble if you don't have very large sample sizes. It does suffer from the same scientific issue of not being tuned and optimized, since we've mostly focused ensemble testing on smaller pools where you don't have the benefit of informing from a larger sample population as part of the calling algorithm.
Sorry to not have this finished and fully validated, and thanks again for the discussion.
Thanks everyone! Looks like the suggestion to stick with joint calling with GATK is the way to go here. Please reopen if this isn't going to work for you all.
Hello,
I'm noticing some errors when running joint calling with multiple variant callers and ensemble mode. The input files are the same as here: https://github.com/bcbio/bcbio-nextgen/issues/2688#issue-412394726
The difference is in the variant callers setup and with using CWL with a local bcbio_nextgen instead of directly running things in bcbio_nextgen. I'm really interested in the ensemble multi-VC scenario with joint genotyping.
CWL ran fine with one VC (so far tested with HaplotypeCaller and Strelka2). When using both callers and enabling ensemble mode (numpass: 2), I get the following error:
When trying to run the same analysis with 4 callers ([gatk-haplotype, strelka2, freebayes, samtools]) and the corresponding jointcallers and ensemble mode (numpass: 3), the outcome is a different error: