bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
986 stars 353 forks source link

RFC: Revisit ensemble calling for paired tumor analyses #361

Closed lbeltrame closed 7 years ago

lbeltrame commented 10 years ago

Rationale for this issue: a few days ago the wet lab people were validating some of the mutations I had found, and while they correctly validated them, it turned out they weren't somatic, but germline.

The reason is that when I evaluated them after the pipeline, there were some loci in common between the callers I used (MuTect and VarScan), however MuTect (more stringent than VarScan) had rejected them (non-somatic variants). As I only evaluated the passing mutations (and since gemini uses --passonly), I completely skipped over that issue.

Ensemble calling would be a nice alternative to prevent these kinds of issues. Perhaps there is no need for enhancements, but at least some rough documentation on how to proceed.

chapmanb commented 10 years ago

Luca; Sorry about the issue and missed calls. I'm definitely agreed that Ensemble would be nice to help resolve these. Realistically we'll need to first have an evaluation dataset in place so we can tune/manage the Ensemble approach, but hopefully once that is finished we can look at this. I'll leave this open until we get time to work on it.

mjafin commented 10 years ago

@lbeltrame @chapmanb Have you seen this paper on somatic ensemble calling "Combining calls from multiple somatic mutation-callers"? http://www.biomedcentral.com/content/pdf/1471-2105-15-154.pdf

Should be interesting for us! I'm reading it now.

lbeltrame commented 10 years ago

@lbeltrame @chapmanb Have you seen this paper on somatic ensemble calling "Combining calls from multiple somatic mutation-callers"? http://www.biomedcentral.com/content/pdf/1471-2105-15-154.pdf

Thanks for this, I missed that. I'll definitely read it!

mjafin commented 10 years ago

Here's another one that is related ("Comparing somatic mutation-callers: beyond Venn diagrams") http://www.biomedcentral.com/1471-2105/14/189

lbeltrame commented 10 years ago

Given that @mjafin has been doing some work on ensemble calling, perhaps we can try to draft a minimum starting configuration on which people can improve upon?

(I'd test this myself but alas the head cluster machine is broken - disk died).

chapmanb commented 10 years ago

Luca; I'd like to take a deeper look at ensemble calling in cancer before "promoting" it with a working example. I just don't have a good enough sense right now of if any of the SVM classification bit help, or if just selecting calls in 2 out of 3 will do good enough. Germline ensemble calling was a lot of work and testing to get right and I'd hate to promote something that is untested. I'm hoping to have more time to help with this after joint recalling and structural variant calling are in place. Miika, if you want to drive this forward definitely do: I don't want to hold things back.

mjafin commented 10 years ago

@lbeltrame my testing involved a lot of manual work to get the (germline) ensemble calling running on the somatic call sets, so it would be good to have a think first how we want to implement and test any approach we come up with.

A few questions that come to mind:

lbeltrame commented 10 years ago

In data venerdì 27 giugno 2014 01:32:29, Miika Ahdesmaki ha scritto:

  • Do we want to use just the PASS/REJECT status from the callers or involve some machine learning based on observed tumour/normal REF/ALT depths?

I would go from "caller trust" for now. There's no consensus on the REF/ALT depths and the fractions, and I fear we might be opening a can of worms here. IOW, let's trust the callers (for now).

  • Do> we want to use the custom parameters reported by the individual callers (like the extra parameters FreeBayes produces and the others don't) in a machine learning framework, and how to do the feature selection for these

That would be nice to have, probably. But I don't recall what VarScan 2 and MuTect give us, I'm guessing SPV (somatic p-value, MuTect and Varscan). I can't seem to recall any other parameters at the moment (I can't precisely check a VCF now).

Feature selection is another grey area.

  • Which data sets to use for the training and evaluation? Chr19 from Dream is too limited

That's probably the biggest hurdle here. Are there any data sets not bound by data policies? I'm not sure there are, but I haven't looked too well.

lpantano commented 7 years ago

Hi

I am closing this because it seems an old issue. Come back if you find other issues or want to continue with this one.

cheers