Scalpel InDel calling support

mjafin commented 10 years ago

Looks like vcf support has been added to Scalpel recently: http://sourceforge.net/p/scalpel/code/ci/master/tree/

Opening this ticket while I'm looking into testing Scalpel and integrating it within bcbio, bear with me

chapmanb commented 10 years ago

Miika; Thanks for the LCR update. It's great that removing them helped avoid issues. My take away from looking at them is that they should either be removed or annotated since they're going to be a large cause of false positives, both from calling and comparison. Longer term incorporating lobSTR (http://lobstr.teamerlich.org/), or further evaluating scalpel or other realigners in these regions might let us be better but this is a bigger research project.

I also fixed the VarScan scaling issue you reported to hopefully help with future evaluations at scale. Thanks again.

chapmanb commented 10 years ago

Miika; There is a new pre-print from the Cold Spring Harbor folks about reducing false positive indel calls, primarily using Scalpel. Beyond avoiding LCRs, especially A/T repeats, and protocol suggestions they have some filters for Scalpel high quality variants based on CHI2 and ALTCOV (Materials and Methods in 'Classifications of INDEL with calling quality based on the validation data of sample K8101' ):

http://biorxiv.org/content/biorxiv/early/2014/06/10/006148.full.pdf

The caller might already apply some of these as part of the standard process, but it's an interesting read in addition to the practical stuff. The 60x coverage for WGS indel detection is a good reference to have.

mjafin commented 10 years ago

Thanks Brad, will definitely give that a thorough read! I think their focus is on lower coverage data and I suspect (based on personal communication with the authors) that high(er) coverage regions can make the microassembler not converge. One of the authors' suggestion was to downsample the data, but I don't know how practical that would be.

In any case, I'm worried that with such a low sensitivity for the NA12878 sample (only 3k+) it really doesn't matter what the specificity is - the sensitivity is unacceptable. If you have the chance to run scalpel for any NA12878 it would be great, just to be sure I haven't botched something..

mjafin commented 10 years ago

Just another quick update, I managed to run ensemble calling from bcbio.variation on the combo of MuTect (SNPs only), FreeBayes and VarDict. I had to manually remove any variants that were REJECTed and also the 'normal' sample column.

The SNP TP rate was up there with FreeBayes (857 vs. 856) and the FP rate was marginally higher than that of MuTect (102->106). For InDels, there were only 707 TP but only 14 FP!

Edit. I'll need to rerun all of this, with all the latest updates to all the tools, at some point and update the tables.

lbeltrame commented 10 years ago

The SNP TP rate was up there with FreeBayes (857 vs. 856) and the FP rate was marginally higher than that of MuTect (102->106). For InDels, there were only 707 TP but only 14 FP!

Impressive results! Can you share the parameters you used for bcbio.variation? My biggest issue with it was that I was not sure what to use.

(And Brad, I wonder if all the findings / results from these investigation could land in a blog post / guide for somatic calls)

mjafin commented 10 years ago

@lbeltrame I used pretty standard settings:

ensemble:
  classifier-params:
    type: svm
  classifiers:
    balance:
    - AD
    - FS
    - Entropy
    calling:
    - ReadPosEndDist
    - PL
    - PLratio
    - Entropy
    - NBQ
  format-filters:
  - DP < 4
  trusted-pct: 0.5
intervals: /ngs/oncology/analysis/external/icgc/dream/chr19_test/tumor-paired/work/align/tumor/2_2014-06-04_tumor-paired-sort-callable.bed
names:
- freebayes
- mutect
- vardict
prep-inputs: false

However I suspect the output indels might be the intersect set of freebayes and vardict as I don't think any of the above annotation will be in the calls.. but then again I'm still trying to understand how the ensemble calling actually works.

This is still very much 'live' work and requires quite a bit of manual intervention (plus I don't know where the chr19 DREAM data could be pulled from without username/password). Our guys also haven't made the vardict code publicly available yet. I'm happy to write something up though once things settle down a bit.

lbeltrame commented 10 years ago

In data giovedì 12 giugno 2014 07:14:39, Miika Ahdesmaki ha scritto:

code publicly available yet. I'm happy to write something up though once things settle down a bit.

Thanks. It goes without saying, but of course I'm going to help on that too.

While I don't have many data sets available, we're starting to do some validations on varying allelic fractions data sets and we plan on validating the low fraction variants with droplet digital PCR, so we could (technically) be able to detect mutations in the 1% range (of course, with impure samples like tumors it's not that easy...).

To add to the discussion and provide some data from my experience on targeted resequencing:

So far MuTect is the best on the SNP calling front even for targeted resequencing: our successful validations come from it by largest part ;
VarScan suffers from insane strand bias as I've mentioned, and I've seen it call SNPs as somatic while they were in fact germline (pyrograms don't lie ;) and only then I realized that the same locus had been REJECTed by MuTect
For indels, we're doing some rounds of validation soon at varying fractions, so I'll make sure to let you know how it goes
You can actually validate mutations called with fractions lower than 5%, however you need highly sensitive methods and be prepared to deal with a lot of noise
Predicted fractions less than 10% are hard to validate even with pyrosequencing

We've validated a MuTect called SNP with digital PCR with a predicted fraction of 3% (observed ~1%) and approximately 1000X in coverage (although with a lot of effort). VarScan has by default a lower limit of 10% IIRC.

chapmanb commented 10 years ago

Luca and Miika; Thanks for all this. This is great progress. I hope after we finish this structural variant calling work we'll have more time to help push this forward as well.

It's great to hear about the Ensemble results, and I suspect it's probably what Miika suggested: the variants called by two reliable callers. My longer term thinking for Ensemble calling is to reduce it to something simpler and rely on these type of heuristics. The SVM can recover a small fraction of additional variants but I'm not sure all the time spent and tuning is worth the gain. A simpler approach should be able to speed things up and make the overall process much cleaner. Thanks again.

mjafin commented 10 years ago

There is a new pre-print from the Cold Spring Harbor folks about reducing false positive indel calls, primarily using Scalpel. Beyond avoiding LCRs, especially A/T repeats, and protocol suggestions they have some filters for Scalpel high quality variants based on CHI2 and ALTCOV (Materials and Methods in 'Classifications of INDEL with calling quality based on the validation data of sample K8101' )

We could possibly filter variants if CHI2 is higher than 10.8 as per the manuscript - I saw some performance improvement from this in the Dream data (but could be just overfitting). ALTCOV within the Scalpel vcf files is a bit of a misnomer, as it actually refers to the coverage of non-REF, non-ALT alternative alleles (third parties if you will).

chapmanb commented 10 years ago

To follow up on Miika's initial evaluation work we now have an automated whole genome evaluation for cancer data using DREAM challenge synthetic dataset 3 (https://www.synapse.org/#!Synapse:syn312572/wiki/62018):

https://github.com/chapmanb/bcbio-nextgen/blob/master/config/examples/cancer-dream-syn3-getdata.sh https://github.com/chapmanb/bcbio-nextgen/blob/master/config/examples/cancer-dream-syn3.yaml

Here are results for MuTect, VarScan, VarDict and FreeBayes with the current development version:

This confirms all the observations above and gives us a practical dataset we can iterate and improve on. The plans are to try and improve filtering to help with false positives then begin testing Ensemble calling and other approaches to get a high quality final callset. It's exciting to see cancer calling continue to improve -- thanks to everyone for their help so far with this.

lbeltrame commented 10 years ago

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512

Absolutely terrific! Thanks to everyone involved!

Luca Beltrame, Ph.D. Translational Genomics Unit Department of Oncology Istituto di Ricerche Farmacologiche "Mario Negri" IRCCS -----BEGIN PGP SIGNATURE----- Version: APG v1.1.1

iQI/BAEBCgApBQJUBKMnIhxMdWNhIEJlbHRyYW1lIDxsYmVsdHJhbWVAa2RlLm9y Zz4ACgkQAT+lC24aTnkxsBAAhxr/ZxEJqfFffwH+R0DRpPSgNlh0wEWrmfFaBANt FMo7vkjxBf8OYjtP2FeNV8rqpA7MFu+8adkNeMDFkcC0ov/44wskMJZWo3vZWf9N nL6H6BhpKIGbiitH2dOiGqJM5ozoCsGerbv9BwM/ACcGNYrg/Kyh8q3W72WRI2MH 5edisQorYxnoeW+rUo4eEnfD7YD+5VmKiOHKbsOP6WByNoTPOVze1RebLx+Jo53y 7h35y7rhuYuXezzqISxdb1GmZNCT99bQQbw9+ZxVp42E47WrfukDMxLqlXTdO9No PKdMlRHIlPdx+7y5yvahp+OL9bOD2JO3gzdBPfq56eGe/vcOJpChx9XrwP1cdoEn kipfw8M1NQyTEJQ6vJ6+Ezrmj5d6ffhtZ5o0eGJ9Hdk6ynzp2vaU636CrkZIQffz 4R3g/NEaya3OY5M0E0Qa3TnglfgkJ1PJkECBIXgiBcoPAVUPgI9SkexWXkkc7qqj 7wtDKf1AeDlsEGDzCAZO206XvphqtNh5p6zay2/XRNEP2UxxXx3A0jgZK2FJwyYK 46pTAAOt0YKgWOeXsJy1+buhFNkzO1In2KamGG1nWMnoVLKTpXsqNpAEfWPx/Aun 1Js6xLyQAi8iM5ey/u4xejxTzyt7gS1rA1lLf9tw20IT+cBqAqeTsTJiNu351/sm lW0= =W89B -----END PGP SIGNATURE-----

zhaoming159753 commented 9 years ago

Hi mjafin , I have a question? could scalpel use for WGS(not WES) of Tumor/Normal sample detecting somatic indels?

chapmanb commented 9 years ago

It does work on WGS samples as well, but is quite slow since Scalpel is primarily developed on exomes. We use it in cases where we need indels, but also recommend the VarDict variant caller included in bcbio which calls indels as well as Scalpel and runs much quicker. Hope this helps.

lbeltrame commented 9 years ago

In data venerdì 9 ottobre 2015 07:32:12 CEST, Brad Chapman ha scritto:

also recommend the VarDict variant caller included in bcbio which calls indels as well as Scalpel and runs much quicker. Hope this helps.

Can it be used as indelcaller? I found scalpel to be extremely slow even for targeted sequencing analyses.

chapmanb commented 9 years ago

Luca -- VarDict calls indels as part of the standard calling. You don't use it as indelcaller but as its own separate caller under variantcaller. We're still working on a post with the latest filtering results, but it does as well at MuTect on SNPs and equivalent or better than Scalpel on Indels so is a good all around caller:

https://github.com/bcbio/bcbio.github.io/blob/master/_posts/2015-10-05-vardict-filtering.md

bcbio / bcbio-nextgen

Scalpel InDel calling support #428