brentp / smoove

structural variant calling and genotyping with existing tools, but, smoothly.
Apache License 2.0
234 stars 21 forks source link

smoove germline validations with Genome in a Bottle NA24385 #16

Open chapmanb opened 6 years ago

chapmanb commented 6 years ago

Brent; Thanks again for all the help getting smoove working. I've got a branch that includes it and have been running validations comparing to the lumpyexpress based approach using the Genome in a Bottle NA24385 germline callset. Here are my numbers so far:

https://github.com/bcbio/bcbio_validations/tree/master/NA24385_sv#smoove-validation

smoove does similar to the custom approach with lumpyexpress and svtyper we had in bcbio earlier so looks good from that perspective. I also compared against manta on this dataset and found that we're not as sensitive, especially on insertions. Relaxing comparisons to not require genotypes recovers a lot of insertion sensitivity, but adds a ton of false positives.

I wanted to sanity check if these results match your expectations and also brainstorm to see if we could improve sensitivity with svtyper. I'm psyched to have the improved specificity as it will make lumpy outputs much more useful for folks and want to keep iterating and improving on these now that we've got some automated tests in place. I'd also love to incorporate additional SV validation data, especially somatic calls, so if there are other truth sets you recommend and use.

brentp commented 6 years ago

Brad, this is great. Let's make smoove better.

IIUC, I can use this vcf (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_UnionSVs_12122017/svanalyzer_union_171212_v0.5.0_annotated.vcf.gz)

but which bams are you using? is it these (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data_indexes/AshkenazimTrio/alignment.index.AJtrio_10XGenomics_bwamem_GRCh37_08142015) ?

I have validated on the CHM1 sets and it looks very good, so I'd be interested to see what you find there.

brentp commented 6 years ago

That truth-set appears to have a lot of illumina only calls. I bet those are actually noisy and your evaluation is rewarding callers that make calls that match the noisy calls derieved from illumina data. Once I get the BAMs, I will verify this. Or, if you could share your chr20 bam, I could start from there.

chapmanb commented 6 years ago

Brent; Awesome, thanks for helping to look at this. Feedback on the calls, GiaB truth set or bcbio parameters would all be very welcome. I forgot to link to the original workflow earlier but this is part of trying to build up standard validation pipelines we can hopefully start running regularly:

https://github.com/bcbio/bcbio_validation_workflows#germline-structural-variant-calling

I uploaded and added links to the BAM as well as the smoove and manta calls and validation outputs from merging/comparing with SURVIVOR:

https://github.com/bcbio/bcbio_validations/tree/master/NA24385_sv#smoove-validation

Thanks again for poking at these.

brentp commented 6 years ago

Hi Brad, I will have a look. One thing is that none of the software (lumpy/svtyper/etc) is tuned for 10X data which has a different coverage profile than regular illumina. I am running on this now and will update this issue as I understand what is going on. thanks for doing the evaulations and sharing the data.

brentp commented 6 years ago

Also, I see your bam was aligned with minimap2. Have you seen how SV calling varies between minimap2 and bwa mem?

chapmanb commented 6 years ago

Brent; Sorry, I know I'm a pain. This does try different tweaks from the standard pipeline with 10x inputs and minimap2. So far I've been going for more relative performance, compared to manta, then absolute sensitivity/specificity. I'm trying to build out a bit of a diverse input set but happy to move back if there is anything key you see that causes issues. Practically I'd love to know how to tune smoove to do a better job with these inputs and also have a more standard current bwa/NovaSeq run validation.

brentp commented 6 years ago

Not a pain at all. I'm glad to have you looking at it. My evaluations with the CHM1-CHM13 pseudo diploid on hg38 show that smoove finds 1583 ostensibly "true" deletions at 0.9 reciprocal overlap and lumpy finds 1588. then smoove finds 1236 deletions that are not in the truth set and lumpy finds 1319. Those are actuall checking the genotype but I haven't compared to manta. I'll do some more digging.

brentp commented 6 years ago

I found quite a few where there is a drop in coverage (down to zero), but no real SV signal other than that. I am skeptical of some of them. I did find one that smoove should find. So I'll start there. missed-chr20-7699050

It has both split-read and paired-end support on looking in IGV, but there are no splits

It is:

20      7699050 DEL007268SUR    N       <DEL>   .       PASS    SUPP=1;SUPP_VEC=1;AVGLEN=3830;SVTYPE=DEL;SVMETHOD=SURVIVORv2;CHR2=20;END=7702880;CIPOS=0,0;CIEND=0,0;STRANDS=+- GT:PSV:LN:DR:ST:TY:CO   ./.:NaN:0:0,0:--:NaN:NaN        0/1:NA:3830:0,0:+-:DEL:20_7699050-20_7702880
chapmanb commented 6 years ago

Brent; Thanks much for digging into these. The GiaB truth set incorporates both long and short read signal so there are definitely events we won't be able to get with short reads. We've tried to categorize that somewhat with manta:

https://github.com/bcbio/bcbio_validations/tree/master/NA24385_sv#illumina-detection-ability

and expect ~30% of insertions and ~35% of small deletions and ~80% of larger deletions. So I hoped starting with ones that manta finds and smoove does not might be a good starting point for missing sensitivity. If you find events that look incorrect that would be useful feedback for the truth set as well, since that's still a work in progress. Thanks again for looking at these.

brentp commented 6 years ago

Hi Brad, I think there are a lot of variants in the truth set that are not real. That said, I did find a few bugs in lumpy-filter and some changes in smoove that might help. One thing you can try without changes is to apply bcftools view -i "SU > 4" $vcf to get a stricter set of calls from smoove. I'm not sure what level of support you were requiring in your custom +lumpyexpress stuff, but this will reduce false positives at the cost of some sensitivity.

I am attaching a gzipped binary with the code changes in case you are able to test.

I will work on recovering some of the deletions that manta finds. The 10X data does seem to generate a lot more spurious split and discordant reads, but this does not explain the low sensitivity relative to manta. Is there a docker container or some simple way I can go from 0 to the evaulation that you have done in short time? I have been looking at the variants > 200 bp by eye in IGV but it would be nice to be able to make changes and evaluate how it updates your metric.

smoove.gz

chapmanb commented 6 years ago

Brent; Thanks much for all this digging. We apply the SU filter right now in bcbio, so should have that extra level of specificity:

https://github.com/bcbio/bcbio-nextgen/blob/9056e960696775c229245d8bcc6b01c3cfd69df6/bcbio/structural/lumpy.py#L54

If you think it's removing too much from sensitivity we could relax but have found it useful to remove the most noisy parts of the output.

I will do on re-running with your updated changes to smoove and report back. If you want to run yourself it is the NA24385-sv CWL workflow in this set of bcbio validations. Generally, do: install bcbio-vm, download_data.sh, then run_bunny.sh:

https://github.com/bcbio/bcbio_validation_workflows#general-instructions

That uses the bcbio-vc docker container with all the tools so you'd have to update it with the smoove you're testing but otherwise everything should be there you need data, code and tool wise. Let me know if you run into any issues with that and happy to make it easier/better documented.

brentp commented 6 years ago

Hi Brad, is there a way to do this without docker? I am working on a cluster. I tried:

bcbio_vm.py --datadir ~/Data/bcbio-vm install --genomes GRCh37 --tools --data

but it is trying to docker pull

chapmanb commented 6 years ago

Brent -- it can run without Docker but requires installing just the bcbio-vm wrapper (for running CWL tools) and then bcbio-nextgen separately. So first, just the bcbio-vm wrappers:

https://github.com/bcbio/bcbio_validation_workflows#install-bcbio-vm

Then install bcbio: no Docker used, it will grab all the tools into a directory via bioconda. You don't need the genome data (since that is included with the CWL downloads) so can pass --nodata. I added docs with how to do this:

https://github.com/bcbio/bcbio_validation_workflows#optionally-install-bcbio

I also swapped over the run_bunny.sh script to use an externally installed bcbio by default. Sorry about the confusion and let me know if you run into any problems at all.

brentp commented 6 years ago

Hi Brad, I have made a new release that should improves your metrics some. I haven't been able to get your setup going, but I have a nice setup for CHM1_CHM13 pseudo-diploid and I am able to run the NA24385 sample and inspect results in IGV.

I'll continue to look into improving the specificity. There are more filters we can apply, it just takes a lot of time and testing to find them.

chapmanb commented 6 years ago

Brent -- thanks so much for this work. smoove 1.0.5 provides a great improvement in sensitivity, especially for insertions. I added it to the validations here:

https://github.com/bcbio/bcbio_validations/tree/master/NA24385_sv#smoove-validation

This is a clear improvement over what we had in bcbio with lumpyexpress and our genotyping and I'm keen to keep in step with the work you're doing so will swap over to use this. As you said, improving sensitivity and specificity is a continuous goal so happy to keep testing and improving as you have time. Additionally, if you want me to incorporate the CHM truth set into the validations we're running I'd be keen to try to automate that as well.

Thanks again for all the help and work on this.

brentp commented 6 years ago

I think part of the reason why manta does so well here is that it was used to help create the "truth-set" while lumpy was not. I dug in to how the truth-set was created and found this: https://docs.google.com/presentation/d/1Ex1fvfD8tsWd9WQoaVYLxNd7EYsgJ1Yi8y2pnhh8sgs/edit#slide=id.p7 do you have more info on that set?

of course I can still use this set to improve our filtering and calling, but the relative performance of manta over lumpy--especially to this degree--is not something we've seen before.

brentp commented 6 years ago

Also, thanks for updating the analysis. I'll continue to improve the specificity.

brentp commented 6 years ago

FYI. New release should improve specificity: https://github.com/brentp/smoove/releases/tag/v0.1.6

chapmanb commented 6 years ago

Brent; This is brilliant, thanks for all the additional work. I'll update the bioconda package and test this; I really appreciate all the work.

Regarding the truth set, your feedback on manta specific calls would be very welcome. The Genome in a Bottle group is putting this together and has been having bi-weekly calls:

https://groups.google.com/forum/#!forum/giab-analysis-team

We'd be incredibly keen to help improve and generalize the truth set so it's better and unbiased. It's all still a work in progress so my goal in having this automated is to help identify and expose these kind of issues. If you have the CHM truth set you've been using I can also include that in the bcbio validation runs so we have a baseline for manta versus lumpy and know how best to improve the NA24385 callset.

Thanks again for all this work.