bigdatagenomics / avocado

A Variant Caller, Distributed. Apache 2 licensed.
http://bdgenomics.org/projects/avocado/
Apache License 2.0
71 stars 42 forks source link

Add an implementation of Mutect #114

Open tdanford opened 9 years ago

tdanford commented 9 years ago

See this paper: http://www.nature.com/nbt/journal/v31/n3/abs/nbt.2514.html (and the supplementary information) for details.

jstjohn commented 9 years ago

Would it make sense to incorporate ideas from Strelka as well? Specifically ability to do indels? MuTect is the best out there now for mutations though. This would be a killer feature!!

http://www.ncbi.nlm.nih.gov/m/pubmed/22581179/

fnothaft commented 9 years ago

@jstjohn that's definitely on the roadmap! We should have the local assembly front-end reintegrated soon, which should give us a better ability to resolve INDELs than Strelka.

jstjohn commented 9 years ago

Is this patent application going to get in our way? I have some experience reimplementing Mutect, but then I found this: https://www.google.com/patents/WO2014036167A1?cl=en&dq=mutect+getz&hl=en&sa=X&ei=YBPLVJfoHsOtogT7zIGgBw&ved=0CB0Q6AEwAA

fnothaft commented 9 years ago

Reviving this thread, based on conversations with @jstjohn. Things that need to be implemented to be feature complete:

Read filters:

Likelihood model:

Variant filters:

I don't know the details about the patent issues. CCing @massie to comment.

jstjohn commented 9 years ago

The statistical model looks nearly complete so far. I am starting to pull @tdanford's mutect code into a fork of your common-filters branch @fnothaft. I didn't see a calculation for the log odds of normal not being heterozygous, but I just dropped that in, you made it nice and easy @tdanford :-) Currently I am working on writing tests for the algorithms.mutect.LikelihoodModel and algorithms.mutect.Mutect. Once I am done with that should I open up a PR to track progress and allow you all to comment/collaborate?

A few thoughts and questions so far:

  1. We could use the sample field of the data in the various mutect models to separate reads that are called "normal" from other reads, since those get treated differently. As I am looking at this, it is occurring to me that we might get super tangled with every mutect related function that treats normal and tumor differently would independently separate reads/AlleleObservations. This could become a mess to untangle later. It seems like the only other option would be to separate the reads in some kind of controller, and pass the appropriate reads to the underlying functions. The main algorithms.mutect.Mutect could do this and pass only appropriate reads to the underlying likelihood models, but then in the mutect spec the filters applied to normal and tumor reads differ a little. That seems a little convoluted to deal with.. Would there be special mutect versions of each of those filters that have to be differentially applied that also check for the sample being "normal" vs "tumor"? Maybe add an optional argument to each of these filter that specifies which sample it is applicable to, then we apply those filters twice, with different parameters for the normal vs tumor sample? This behavior could be masked for the general case where most algorithms using those filters wouldn't care to only apply them to a specific sample?
  2. It seems that we could either simultaneously calculate 1) log odds of a non-reference site being present in the tumor and 2) log odds of the normal not being heterozygous, or we could do the normal somatic classification in a post-processing step. Thoughts on which would be better/cleaner @fnothaft or @tdanford?
  3. There are some second-level filters that are a function of how many reads were discarded in base filtering. For example, if more than 30% of tumor reads fails either 1) the sum of mismatching base quality scores being over 100, or 2) 30%+ soft clipped. Getting at this number requires knowing the depth of the tumor data prior to applying just those two filters, and the depth of the tumor data after applying those two filters. How would you recommend tracking something like this @fnothaft?
    • as a side note these are another good example of a filter that only gets applied to the tumor sample. Although it seems a little odd here at first glance, it is actually conservative to only apply this filter to tumor. You allow noise to contribute to a normal looking heterozygous (reduces risk of selectively removing reads supporting a real heterozygous site in normal), but you reduce the number of noisy reads that could contribute to a tumor site being classified as a mutant.
jstjohn commented 9 years ago

Tests are progressing nicely. Maybe 50% done with testing/fixing the likelihood model. Found a few bugs that were a result of @tdanford getting tricked by Nature's poor equation formatting. When they meant $ f \frac{e_i}{3} $ it came out looking more like $ f^{\frac{e_i}{3}} $.

Work is here if you are curious pre-PR: https://github.com/jstjohn/avocado/tree/mutect_work

jstjohn commented 9 years ago

Just finished up some initial tests and opened the pr #167 to track progress.

jstjohn commented 9 years ago

Quick status update on mutect filters that are functions of the alleles/reads supporting the mutation in the tumor (where 90% of the special sauce lies).

Done/potentially needs testing:

TODO:

jstjohn commented 9 years ago

marking all complete, although more tests need to be written on everything other than the second, fourth, and final bullet points above. Basically the sum of mismatching bases needs to be properly calculated in the ReadExplorer code, the mate rescue status needs to be passed through, need to implement the end-cluster filtering based on the MAD/median, and the power-based strand bias filter.

jstjohn commented 9 years ago

TODO:

Other than that the filters are all in place, and the likelihood model is doing what it should be doing. I modified the code slightly so it should do something reasonable if it encounters an indel and you want to try out using the Mutect algorithm on indels. That is disabled by default.