Validation of large deletions calling in bcbio with HuRef benchmark compared to ERDS and CNVNator

naumenko-sa commented 6 years ago

Hello, Brad!

Thanks for the great pipeline!

Currently among read-depths CNV callers only CNVkit is supported in bcbio. In our experience ERDS and CNVNator perform better than CNVkit. What would be your thoughts to include them? We can put some developing efforts from our side.

Thanks! Sergey

chapmanb commented 6 years ago

Sergey; Thanks for starting this discussion. We currently support TitanCNA as an alternative tumor/normal caller and are also planning to work on PureCN for tumor only calling. We've been considering incorporating the new GATK4 germline and somatic CNV callers.

As a more general discussion, what issues are you finding with the current CNVkit caller that ERDS and CNVNator improve on? What type of sample inputs are you dealing with (WGS, exome, tumor...). Do you have any test cases we could incorporate to help motivate both improvements to CNVkit as well as incorporation of new methods?

We're always open to including new tools but try to do this in a way where we cover gaps and understand the strengths and weaknesses of approaches. Thanks again for the discussion.

naumenko-sa commented 6 years ago

Hi Brad! We are mostly focused on the germline CNV calling using WGS. @bjtrost will post more details soon. Sergey

bjtrost commented 6 years ago

Hi Brad,

Although I have not tested or used it myself, my understanding from reading the CNVkit documentation and paper abstract is that it is primarily designed for performing tumour-normal comparisons using pooled references from capture data. It would also be useful to include CNV-detection algorithms designed for germline WGS data and work on single simples (the "single sample" aspect is desirable because, in our experience, the use of pooled references can result in deletion calls where there are common duplications and vice-versa). In a recent paper, we describe a workflow for germline CNV detection from WGS data. Part of this paper is comprised of an evaluation of six read depth-based CNV callers, and we identified ERDS and CNVnator as the best ones. Thus, I think it would be valuable to add ERDS and CNVnator to bcbio for the purposes of performing germline CNV calling from WGS data. Our paper is here:

https://www.sciencedirect.com/science/article/pii/S0002929717304962?via%3Dihub

Cheers,

Brett

etal commented 6 years ago

It looks like the main improvement ERDS and CNVnator offer over CNVkit is the segmentation algorithm. CNVkit uses CBS by default, which is reliable for panel and exome data, whereas:

ERDS uses a paired HMM to segment read depths and SNPs simultaneously
CNVnator uses a fast implementation of the mean-shift algorithm, which is more of a heuristic approach but seems to work well

Other recently developed callers like CLAMMS and GATK4 were not evaluated -- both of those use HMMs for segmentation, so I'd be interested to see how they stack up.

CNVkit also recently rolled out an HMM-based segmentation method, but it hasn't been evaluated thoroughly yet. Other users report better results with CNVkit's "flasso" (fused lasso) method over the default CBS, so if there is a certain WGS benchmark dataset that is relevant to a lot of users, I'd be happy to work with someone to tune CNVkit's segmentation to work better on that dataset.

naumenko-sa commented 6 years ago

My recent NA12878 WGS vs GIAB germline SV validation shows, unfortunately, that CNVkit is not very useful in this case: sv df-del I think if CNVkit with "flasso" vs GIAB WGS 30-50x will show good result, that would be convincing.

bjtrost commented 6 years ago

Hi,

Just wanted to note that CLAMMS is only for exome data ("Please note that CLAMMS is not intended to be used with whole-genome sequencing data"), and the CNV-detection functionality in GATK4 seems to be just for somatic, although they are testing functionality for germline (https://gatkforums.broadinstitute.org/gatk/discussion/11344/current-status-of-gatk4-germlinecnvcaller-tools-and-best-practices).

Cheers,

Brett

chapmanb commented 6 years ago

Brett, Sergey and Eric; Thanks for all this discussion and pointers to the paper. This is great work and I'd read it but not put together all the connections.

Is it possible to extract a couple of the test cases from that paper that we could use as a starting point for evaluating CNV calls on single sample germline calls? I'd really like to have this included in the bcbio validations so we can consistently evaluate. Sergey, is that validation done against the NA12878 calls? That's not a great test case for CNVs unfortnately, since it's all based on split/paired style callers (hence their better performance). I suspect other read depth based callers would also not look great on small events or on specificity. I've been hoping to develop a better evaluation metric for this and would love to have this tie in with Brett's paper so we can compare/contrast to the work that is already done.

I'd be keen to explore the GATK4 in progress HMM approaches or adjusting CNVkit as first passes with this validation in place. Practically, ERDS has an academic only license and is only for WGS, and CNVnator does not seem to be under active development. As you mention CLAMMS is exome only. Since it's a ton of work to integrate and support methods for the long term, we'd ideally picks ones we can work with the authors on like we currently do with Eric and CNVkit.

If adjusting CNVkit or integrating GATK4 would get us equivalent calls on some of your benchmarks that would be a best case outcome. What do y'all think about this as a plan? I'd be happy to hear your feedback and thoughts.

naumenko-sa commented 6 years ago

Brad, thanks for the clarification!

Yes, I've validated with giab NA12878 calls - a standard SV validation in bcbio, i.e. we need a benchmark not biased towards split/paired methods.

So the potential roadmap is to use one of the benchmarks referenced in the Brett's article (HuRef, NA12878, AK1), to validate Eric's CNVkit in flasso mode and show that it is comparable to ERDS and CNVnator in Table 1 of Brett's article.

I'd be happy to run this.

However, @etal and @bjtrost, do you think it is worth using read-depths CNV calling methods for single sample WGS, given that the best methods (ERDS and CNVkit) discover 51% and 44% events in the best case (filtered), having a lot of false discoveries (59% and 32%) (table 1 in Brett's article), compared to split/paired methods with 80-90% events discovered and 20-40% of false positives (see my picture above). Also split/paired methods call events of 100-1000bp, which is crucial.

Maybe the strengths of coverage-based methods is in tumour/normal and multisample analysis rather than in single sample WGS, and here we should just use manta/lumpy/wham? Or you think that coverage-based methods discover an orthogonal set of events, and on a non-biased benchmark coverage-based and split/paired methods would complement each other?

Sergey

chapmanb commented 6 years ago

Sergey -- thanks so much for the help with this, those plans sound perfect.

My take on the calling method are that both depth based and split/paired methods are useful can capture different events, hence are complementary. The NA12878 example above uses a lot of short read split/paired methods as inputs so will miss some depth-only events and isn't meant to be fully comprehensive.

In general, short reads can only capture a component of all the events. Genome in a Bottle has been working on a new truth set with NA24385 and more long read technologies and this is a comparison of methods like manta which tries to quantify this:

https://github.com/bcbio/bcbio_validations/tree/master/NA24385_sv

Having a germline CNV specific validation based on Brett's work would be an incredible help to compare and improve methods in bcbio. I've been working on a tumor/normal CNV one based on HCC2218 (https://github.com/Illumina/canvas#demo-tumor-normal-enrichment-workflow) and generally working on making common sets of validations available (https://github.com/bcbio/bcbio_validation_workflows).

Thanks again for this great discussion.

bjtrost commented 6 years ago

A few more comments:

(1) Re: "do you think it is worth using read-depths CNV calling methods for single sample WGS, given that the best methods (ERDS and CNVkit) discover 51% and 44% events in the best case (filtered), having a lot of false discoveries (59% and 32%) (table 1 in Brett's article), compared to split/paired methods with 80-90% events discovered and 20-40% of false positives (see my picture above)." - I don't think it is very meaningful to compare these numbers. The benchmarks used in the CNV paper were mostly orthogonal to the methods being tested, whereas the GIAB set (as you mention) is biased toward SR/PEM methods.

(2) I agree that the read depth callers and the SR/PEM callers are complementary, so it would be good to have both represented in bcbio. In particular, the read depth methods are better at caller larger variants, which are more likely to be clinically relevant.

(3) ERDS, as with CNVnator, is no longer maintained.

(4) Another good source of benchmarks may be a recent paper by Charles Lee's group, currently on BioRxiv:

https://www.biorxiv.org/content/early/2017/09/23/193144

The authors kindly provided me with PacBio benchmarks for the children of the three trios described in this paper.

Cheers,

Brett

naumenko-sa commented 6 years ago

Hello Brad, Brett, and Eric!

I’ve validated deletions using HuRef (Venter) dataset from Brett’s 2018 article (the updated version including PacBio calls): 7,952 deletions after bedtools merge, sizes: [50 – 171,021,482]. For comparison, GIAB’s NA12878 truth_DEL.merged.bed has 2,668 deletions of [50 - 139,620].

_evaluate_one function in structural/validate.py uses pybedtool’s intersect to validate intervals. By default the overlap reported if there is min 1bp overlap (the equivalent of bedtools intersect -u -a test.bed -b truth.bed). For deletions using 50% overlap threshold would be more accurate, because events length >> 1bp, i.e. bedtools intersect -u -r 0.5 -R 0.5 -a test.bed -b truth.bed

Surprisingly, in my test it does not make a big difference, just 15 deletions out of 2668 in GIAB, and it seems that pybedtools does not support parameters -f and -F, so we continue with the default intersect method.

I run bcbio sv detection using WGS of HuRef (Venter). For each caller I selected deletions which passed filters (except of CNVkit) and merged the overlapping ones with.

bcftools query -i 'FILTER="PASS" && SVTYPE="DEL"' -f '%CHROM\t%POS\t%END\n' tool.vcf.gz | grep -v GL  > tool.PASS.DEL.bed
bedtools merge -i tool.PASS.DEL.bed | awk '{print $0"\t""DEL_tool"}' > tool.PASS.DEL.merged.bed

I validated deletions with https://github.com/naumenko-sa/cre/blob/master/cre.sv.validate_bed.py and plotted the picture with https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/structural/validate.py. I changed the intervals to be on the same page with Brett’s tests.

sv_validation_venter_brett_bins

We see that in a more accurate HuRef benchmark, which is not biased by read-split SV calling methods, the best results are from MetaSV with less than 60% of events detected in the best case. CNVKit as a read-depth method performs better for events longer than 10000, and calls 50% of them.

So we have a lot of room for improvement in the ‘one genome’ use case.

@etal, do you think that there is a way to push up the performance of cnvkit for this use case (not for tumor-normal). If so, please provide instruction on how to run it.

@bjtrost , could you please comment on the comparison of Table 1 from your 2018 article with these results? (Because you have >1kb filter, low complexity filter, 1+, 2+, it is a bit tricky to compare). Do you think that in its league (>10k) CNVkit plays similarly (50%) to ERDS and CNVnator, better, or worse?

@chapmanb, do you have any ideas which tools I can test with the benchmark to potentially include in bcbio for 'one genome' use case?

Sergey

chapmanb commented 6 years ago

Sergey; Thanks for sharing this work, this is so helpful. We've also been working on a single genome test case with the NA24385 Genome in a Bottle truth set mentioned above, and have some numbers as part of incorporating smoove, the new lumpy-based workflow:

https://github.com/bcbio/bcbio_validations/tree/master/NA24385_sv#smoove-validation

This swaps over to use SURVIVOR for merging and validation to avoid needing custom bedtools intersections:

https://github.com/fritzsedlazeck/SURVIVOR/wiki/Methods-and-Parameter#4-merge-or-consensus-calling-from-multiple-sv-vcf-files

I'd like to add back in stratifications as well since that is really useful.

Are you able to share the HuRef benchmark and truth set? I'd love to incorporate this as an additional automated validation so we can continue to iterate and improve on this.

I'd definitely welcome ideas from Eric to improve CNVkit for this case and we can continue to improve sensitivity and specificity for the cases it should be able to tackle.

Thank you again for this great discussion and work.

etal commented 6 years ago

Thanks for sharing, @naumenko-sa, I'm happy to help tune CNVkit to perform better on single-sample germline WGS cases like this.

It's especially helpful if you can point to other read-depth-based tools that perform well on this data -- CNVkit breaks down copy number analysis into discrete steps that can each be swapped with better methods as they're discovered. In the case of CNVnator, the main difference is the segmentation algorithm, mean-shift versus CNVkit's default circular binary segmentation (CBS). The normalization step is not much different since there's no matched control or pool of control samples. ERDS uses SNP b-allele frequencies to improve its calls, whereas CNVkit's support for that is incomplete so far (etal/cnvkit#34).

In the latest version of CNVkit there are a couple of segmentation methods worth trying:

cnvkit.py segment -m flasso -- Fused Lasso; several users have reported this performs better than CBS in benchmarks on WGS and exome data. It should probably be the default when WGS data is specified, but I haven't done enough benchmarking myself to be sure.
cnvkit.py segment -m hmm-germline -- a new HMM-based method that expects a small number of discrete copy number states, as you'd see in germline samples but not in tumors. Let me know if you're interested in trying this but not able to get it working.

In a couple months I might be able to point to some new functionality in CNVkit to perform specific tests for hemizygous and homozygous losses in germline samples. I'll be sure to let this group know when that lands.

naumenko-sa commented 6 years ago

Brad, I agree that having HuRef SV benchmark in bcbio would be very useful. I'm not the owner of the dataset, and I've made a request to open it. Hopefully, I would be possible to release raw input WGS data. I'll keep you posted. The truth set is already available as supplementary files here http://www.cell.com/action/showImagesData?pii=S0002-9297%2817%2930496-2 File S1: HuRef CNV benchmark (GCRh37/hg19 coordinates). File S2: HuRef CNV benchmark (GCRh38/hg38 coordinates).

In the meanwhile, I have added erds/cnvnator results from @bjtrost's article to the validation picture. Same benchmark, same validation script, so we can compare now. sv_validation_venter_erds_cnvnator

Read-depth methods (erds, cnvnator) start performing well at 1000bp, cnvkit at 10000.

Overall (regardless of the deletion size) 93% of erds TP deletions and 77% of cnvnator's are captured by metasv/filtered. 95% and 82% by metasv/unfiltered. In the [1000, 10000] bin it is 95% (erds) and 83% (cnvnator)

So, I'm satisfied with metasv, and I think that using erds and cnvnator will not boost the sensitivity to 100%. However, you may see that the overall precision/sensitivity in HuRef benchmark is much lower than in GIAB benchmark, so there is a room for improvement.

@etal , I think it would be crucial to improve cnvkit's performance in the [1000,10000] bin, where erds and cnvnator perform much better. I'm trying to use segmentation algorithms you suggested. My approach is very primitive - I've dumped bcbio commands into script: https://github.com/naumenko-sa/crg/blob/master/crg.cnvkit.sh, checked that the default segmentation works and produces the same result as in bcbio, and then I altered segmentation with -m flasso or -m hmm-germline. The default segmentation was very fast, I have not recorded the running time, 10 min or so. With flasso segmentation run over the walltime of 10 hours and has been killed. I submitted it again, but even if it finishes in 15-20 hours, it would be too slow to use by default. With -m hmm-germline I passed the segmentation step, but I'm stuck with segmetrics - also takes 10 hours. Please advice me how to improve the running time.

I'm using cnvkit from bcbio/conda: cnvkit.py version 0.9.2.dev0

Sergey

naumenko-sa commented 6 years ago

@etal sorry, in my hands -f hmm-germline shows the same results as the default segmentation method. Flasso exceeded the running time of 2 days with 7 threads, unfortunately, it is too slow for the practical use in WGS. small-del

etal commented 6 years ago

@naumenko-sa Thanks for the details. It's suspicious that the 'hmm-germline' results were identical to the default 'cbs' method; do you think CNVkit may have run 'cbs' in both cases due to some bug?

For the 'flasso' and 'segmetrics' runtimes, it sounds like there's a bottleneck in the code that is worst on WGS data, specifically when a lot of segments are produced. WGS and exomes seem to require tighte significance thresholds (-t) to avoid over-segmenting. So, maybe a more stringent cutoff (e.g. -t 1e-6 for 'flasso', -t 100 for 'hmm-germline') might help avoid the bottleneck. You can watch the per-chromosome log messages for a couple minutes to see if you're getting hundreds or thousands of segments, or very slow segmentation, and kill the process early if that's the case. (The 'flasso' method is not parallelized, since doing so triggers an obscure error related to memory-mapped files used by the underlying R package.)

I'll profile the code to see if there's a hotspot shared by segment and segmetrics that I can speed up.

etal commented 6 years ago

Quarterly update: The segment and segmetrics commands are now significantly faster in the development version of CNVkit. For the flasso method, most of the time seems to be spent within R, and I don't think I can speed it up further, though -t 1e-6 should still help.

I'll see if I can run these benchmarks myself and troubleshoot the hmm-germline situation.

naumenko-sa commented 6 years ago

Hi @etal ! Thanks for the update! I will try to run the test with the latest version. Unfortunately, you will not be able to run this, as you don't have input WGS data. The owners have not released it yet, I hope it will happen at some point, but for now I should it by myself. But I have permission to share results :)

naumenko-sa commented 5 years ago

Hi everyone!

Sorry for the long silence. Good news - HuRef WGS fastq just has been published: https://www.ncbi.nlm.nih.gov/sra/SRX5395595[accn] making those validations reproducible.

Some thoughts:

hopefully, SV calling improved in bcbio since the last year - will see, at what extent
we called SVs in a cohort of samples with metaSV(manta,lumpy,what,cnvkit). metaSV uses spades, which is a great assembler, but assembling all SVs across the genome takes a while.
recently smoove/duphold were introduced to bcbio. What is their interaction with metaSV? Could anybody comment on the logic?

Sergey

chapmanb commented 5 years ago

Sergey; Thanks for continuing to look at this and for the great news about HuRef accessibility. We haven't done any specific work with MetaSV and the Smoove/Duphold update so I'm not sure about the interaction. The outputs should be fairly similar (although hopefully more precise and sensitive) but I don't typically use MetaSV due to the speeds and focus on the prioritization approach when combining multiple callers. Thanks again for keeping these comparisons going.

naumenko-sa commented 5 years ago

Hi Brad!

I've traced (using bcbio-nextgen-commands.log) what happened when running the below config file (a fragment):

variant_regions: /path/regions.bed
sv_regions: /path/regions.bed
svcaller:
- lumpy
- manta
- cnvkit
- metasv
- wham
variantcaller: false

Roughly: 1.regions.bed cleaning 2.Coverage calculation for regions.bed with mosdepth and hts_nim_tools 3.cnvkit fed by coverage information 4.smoove-lumpy-duphold 5.manta-duphold 6.wham 7.metasv combined calls from 4 tools

Smoove helps lumpy to make less noisy calls, and duphold adds coverage information to vcf called by lumpy and manta (break point algorithms).

It looks like regions.bed affects only cnvkit, i.e. the other 3 tools are doing calls genome-wide.

Would it make sense to restrict calling in manta/lumpy/wham by sv_variants regions too? Use cases: quickly call SVs in a region, not waiting for genome-wide calling. Might also be as a test for this module.

If so, what would be the best place to add this?

bcbio/structural/lumpy.py: _run_smoove: add whole_genome.bed/sv_regions.bed to excluded regions
bcbio/structural/manta.py: _prep_config: use --callRegions=sv_regions.bed
it looks like wham would require to subset a bam file to feed into it.

Sergey

chapmanb commented 5 years ago

Sergey; Thanks for looking at this. The general idea behind not excluding regions tightly to variant or SV regions is that you'd want to capture long range events that span, or only partially overlap, the regions of interest. Lots of people have interests in fusions or other long range events that only span part of the SV regions, so we don't want to lose these by being overly exclusive. In general it's really hard to effectively call SVs in capture regions, so trying to give us the best chance to pull something out if it's there. Hope this helps explain the current logic.

naumenko-sa commented 5 years ago

Hello!

I revisited SV validations in a year, adding Delly. See the results here: https://github.com/naumenko-sa/bcbio_validations/tree/master/huref_sv There is an improvement, but overall progress in SV calling is very slow.

I also created a PR: https://github.com/bcbio/bcbio_validations/pull/2

Sergey

chapmanb commented 5 years ago

Sergey; Thanks so much for keeping this up to date and for writing it up. This is really helpful. I'm agreed with your conclusions that we're near the point of diminishing returns for short read variant detection. We also haven't focused much on this in bcbio and the major update with incorporating smoove to do lumpy calling. I appreciate you continuing to look at this and ensure it's tuned and validated.

bcbio / bcbio-nextgen

Validation of large deletions calling in bcbio with HuRef benchmark compared to ERDS and CNVNator #2313