bcbio / bcbio.variation.recall

Parallel merging, squaring off and ensemble calling for genomic variants
20 stars 3 forks source link

Using recall for somatic samples #9

Open dgaston opened 8 years ago

dgaston commented 8 years ago

Hi Brad,

I am assuming the recall jar uses optimized parameter settings for the variant caller based on bcbio settings. I've been looking at exploring incremental join calling on tumour only samples. I use parameter settings largely similar to bcbio but of course with some specific tweaks and threshold values for acceptable allele frequencies. Is it possible to tweak recall to also pass command-line parameters to the caller?

chapmanb commented 8 years ago

Daniel; Thanks for taking a look at bcbio.recall for cancer calling. You're right that this has primarily been tested with larger germline calling. The settings we use for recalling for FreeBayes are here:

https://github.com/chapmanb/bcbio.variation.recall/blob/master/src/bcbio/variation/recall/square.clj#L62

I don't know of existing workflows for large numbers of tumor-only samples to have a great suggestions about setting parameters for that case. How many samples are you looking to call together? What caller were you targeting?

The best approach to tweaking this is probably to add an additional caller target (say, freebayes-somatic) that has the specific tweaks for somatic calling instead of germline. Happy to help with this if you have more specifics about the command lines you're looking to run. Thanks again.

dgaston commented 8 years ago

Thanks Brad. Yes tweaking by adding somatic specific callers would probably be the best approach. I was looking at experimenting since I don't know how well incremental joint calling has been investigated in this space. I'd probably be looking at doing 48 samples or so in a run, although testing with more would be interesting. Of course they are much smaller than full exomes since these are all from targeted panels. Freebayes, vardict, and platypus are the callers already used I believe? And I have all three in my workflow so testing those would be good. On Nov 28, 2015 10:14 PM, "Brad Chapman" notifications@github.com wrote:

Daniel; Thanks for taking a look at bcbio.recall for cancer calling. You're right that this has primarily been tested with larger germline calling. The settings we use for recalling for FreeBayes are here:

https://github.com/chapmanb/bcbio.variation.recall/blob/master/src/bcbio/variation/recall/square.clj#L62

I don't know of existing workflows for large numbers of tumor-only samples to have a great suggestions about setting parameters for that case. How many samples are you looking to call together? What caller were you targeting?

The best approach to tweaking this is probably to add an additional caller target (say, freebayes-somatic) that has the specific tweaks for somatic calling instead of germline. Happy to help with this if you have more specifics about the command lines you're looking to run. Thanks again.

— Reply to this email directly or view it on GitHub https://github.com/chapmanb/bcbio.variation.recall/issues/9#issuecomment-160360806 .

chapmanb commented 8 years ago

I don't know of ready to run approaches to this, or validations to demonstrate how much it helps versus standard tumor/normal analysis. @brentp and @arq5x mentioned they were hoping to work on this with FreeBayes so might have some advice. FreeBayes is a good first target for this since it already handles both tumor/normal and pooled germline cases, and is sensitive and precise on germline calls.

For 48 samples, my suggestion would be to do a workflow like:

So I wouldn't try to do anything fancy like recalling, and then evaluate this versus standard tumor/normal with a caller like VarDict to see if you're getting improved resolution, especially of low frequency variants. I'd be very interested in hearing how the experiment turns out. Hope this helps and thanks for all the discussion.

dgaston commented 8 years ago

No problem, it seems like it is a bit of an under-looked at piece, so I'm happy to do some experimentation in this area. All of this is part of the pipeline construction/validation phases prepping for clinical work. In our case we don't have matched normals, for clinical sequencing this isn't typically being done due to cost constraints coupled with working with smaller targeted panels and only reporting on a subset of discovered variants.

chapmanb commented 8 years ago

Daniel; Without matched normals, this is a bit different problem since you're doing multi-sample calling but with also trying to identify lower frequency variants in each sample. FreeBayes is the a good target for doing this since it handles both: MuTect and VarDict do low frequency, but not populations. HaplotypeCaller populations but not low frequency. I'm not sure how best to set the parameters to get good sensitivity and precision in these cases. If you have any known truth sets it would be worth exploring with that an a combo of the multi-sample and cancer options:

https://github.com/chapmanb/bcbio-nextgen/blob/4c57c0666e77b442013cb658a750b40afc466ea6/bcbio/variation/freebayes.py#L92 https://github.com/chapmanb/bcbio-nextgen/blob/4c57c0666e77b442013cb658a750b40afc466ea6/bcbio/variation/freebayes.py#L134

I'd definitely have interest in hearing your results.

chapmanb commented 8 years ago

Daniel -- it would also be worth following this FreeBayes thread: ekg/freebayes#228 Erik and Brent are talking about more generalized approaches for handling multi-sample tumor calling.

dgaston commented 8 years ago

Thanks for the heads up, much appreciated.