Add an implementation of Mutect

tdanford commented 9 years ago

See this paper: http://www.nature.com/nbt/journal/v31/n3/abs/nbt.2514.html (and the supplementary information) for details.

jstjohn commented 9 years ago

Would it make sense to incorporate ideas from Strelka as well? Specifically ability to do indels? MuTect is the best out there now for mutations though. This would be a killer feature!!

http://www.ncbi.nlm.nih.gov/m/pubmed/22581179/

fnothaft commented 9 years ago

@jstjohn that's definitely on the roadmap! We should have the local assembly front-end reintegrated soon, which should give us a better ability to resolve INDELs than Strelka.

jstjohn commented 9 years ago

Is this patent application going to get in our way? I have some experience reimplementing Mutect, but then I found this: https://www.google.com/patents/WO2014036167A1?cl=en&dq=mutect+getz&hl=en&sa=X&ei=YBPLVJfoHsOtogT7zIGgBw&ved=0CB0Q6AEwAA

fnothaft commented 9 years ago

Reviving this thread, based on conversations with @jstjohn. Things that need to be implemented to be feature complete:

Read filters:

MQ filter; this shouldn't be implemented as a read filter, per se. See below.
Base quality filter. In #154.
Soft clipping filter. In #154.
Mate rescue filter.
Mismatch quality sum filter.
Overlapping read agreement filter.

Likelihood model:

@tdanford how far along did you have the statistical model?
The low MQ read filter should be implemented at this stage. If you do it here, you eliminate the need to do a region join during the variant filtration stage.
The clustered position and proximal gap filters should be implemented by annotating genotypes during the "genotyping" stage. This eliminates two expensive region join-like operations.

Variant filters:

Strand bias. In #154.
Multiallelic site. In #154.
Proximal gap.
Poor mapping.
Clustered position.
Observed in control.
Something about POC.

I don't know the details about the patent issues. CCing @massie to comment.

jstjohn commented 9 years ago

The statistical model looks nearly complete so far. I am starting to pull @tdanford's mutect code into a fork of your common-filters branch @fnothaft. I didn't see a calculation for the log odds of normal not being heterozygous, but I just dropped that in, you made it nice and easy @tdanford :-) Currently I am working on writing tests for the algorithms.mutect.LikelihoodModel and algorithms.mutect.Mutect. Once I am done with that should I open up a PR to track progress and allow you all to comment/collaborate?

A few thoughts and questions so far:

We could use the sample field of the data in the various mutect models to separate reads that are called "normal" from other reads, since those get treated differently. As I am looking at this, it is occurring to me that we might get super tangled with every mutect related function that treats normal and tumor differently would independently separate reads/AlleleObservations. This could become a mess to untangle later. It seems like the only other option would be to separate the reads in some kind of controller, and pass the appropriate reads to the underlying functions. The main algorithms.mutect.Mutect could do this and pass only appropriate reads to the underlying likelihood models, but then in the mutect spec the filters applied to normal and tumor reads differ a little. That seems a little convoluted to deal with.. Would there be special mutect versions of each of those filters that have to be differentially applied that also check for the sample being "normal" vs "tumor"? Maybe add an optional argument to each of these filter that specifies which sample it is applicable to, then we apply those filters twice, with different parameters for the normal vs tumor sample? This behavior could be masked for the general case where most algorithms using those filters wouldn't care to only apply them to a specific sample?
It seems that we could either simultaneously calculate 1) log odds of a non-reference site being present in the tumor and 2) log odds of the normal not being heterozygous, or we could do the normal somatic classification in a post-processing step. Thoughts on which would be better/cleaner @fnothaft or @tdanford?
There are some second-level filters that are a function of how many reads were discarded in base filtering. For example, if more than 30% of tumor reads fails either 1) the sum of mismatching base quality scores being over 100, or 2) 30%+ soft clipped. Getting at this number requires knowing the depth of the tumor data prior to applying just those two filters, and the depth of the tumor data after applying those two filters. How would you recommend tracking something like this @fnothaft?
- as a side note these are another good example of a filter that only gets applied to the tumor sample. Although it seems a little odd here at first glance, it is actually conservative to only apply this filter to tumor. You allow noise to contribute to a normal looking heterozygous (reduces risk of selectively removing reads supporting a real heterozygous site in normal), but you reduce the number of noisy reads that could contribute to a tumor site being classified as a mutant.

jstjohn commented 9 years ago

Tests are progressing nicely. Maybe 50% done with testing/fixing the likelihood model. Found a few bugs that were a result of @tdanford getting tricked by Nature's poor equation formatting. When they meant $ f \frac{e_i}{3} $ it came out looking more like $ f^{\frac{e_i}{3}} $ .

Work is here if you are curious pre-PR: https://github.com/jstjohn/avocado/tree/mutect_work

jstjohn commented 9 years ago

Just finished up some initial tests and opened the pr #167 to track progress.

jstjohn commented 9 years ago

Quick status update on mutect filters that are functions of the alleles/reads supporting the mutation in the tumor (where 90% of the special sauce lies).

Done/potentially needs testing:

fewer than 3 tumor reads with insertions and also fewer than 3 reads with deletions, within 5 base pairs of the given allele (separate, not summed).

TODO:

MAPQ 0 fraction of reads in tumor or normal more than the MAX_MAPQ_0_FRACTION (default 0.5)
fraction of reads in tumor failing one of the following filters (>= 0.3 bases soft/hard clipped, mate rescue, sum of q-score of mismatching bases >=100) is over 0.3 of the reads that pass the other preliminary read-based filters
maximum mapq score of allele supporting the mutation must be at least 20
Median Absolute Deviation, and the raw median of the mutant allele location within reads cannot cluster too close to either the beginning or end of each read. Basically you calculate the median of the allele supporting reads, and the MAD of those. If the MAD is <= 3 and the median is <= 10, dump it. Similarly recalculate these numbers, but using the distance of each allele from the last base of each read, and apply the same filters. If either the forward-strand method, or reverse strand method gives you bad results, it fails.
If there are more than 0.015 normal reads supporting the mutation, check the sum of quality scores supporting the mutation in normal, be sure it is less than 20.
Strand bias filtering (this is not a standard fisher test). If the power to detect the mutation is at least 0.9 on a particular strand, calculate the LOD of the mutant on that strand, and make sure it is at least 2.0.
- The power to detect a mutant is a function of depth, and the mutant allele fraction (unstranded). Basically you assume that the probability of observing a base error is uniform and 0.001 (phred score of 30). You see how many reads you require to pass the LOD threshold of 2.0, and then you calculate the binomial probability of observing that many or more reads would be observed given the allele fraction and depth. You also correct for the fact that the real result is somewhere between the minimum integer number to pass, and the number below it, so you scale your probability at k by 1 - (2.0 - lod_(k-1) )/(lod_(k) - lod_(k-1)).

jstjohn commented 9 years ago

marking all complete, although more tests need to be written on everything other than the second, fourth, and final bullet points above. Basically the sum of mismatching bases needs to be properly calculated in the ReadExplorer code, the mate rescue status needs to be passed through, need to implement the end-cluster filtering based on the MAD/median, and the power-based strand bias filter.

jstjohn commented 9 years ago

TODO:

Determine if a site is a known DBSNP site within the MuTect genotyper. Any ideas for this @fnothaft?
Calculate the sum of phred scores for mismatching bases and pass that through. Is this easy to get at in the ReadExplorer?
Write more tests

Other than that the filters are all in place, and the likelihood model is doing what it should be doing. I modified the code slightly so it should do something reasonable if it encounters an indel and you want to try out using the Mutect algorithm on indels. That is disabled by default.

bigdatagenomics / avocado

Add an implementation of Mutect #114

Done/potentially needs testing:

TODO: