BWA MEM mapping can produce no mapped reads to some contigs

cerebis / meta-sweeper

Parametric sweep of simulated microbial communities and metagenomic sequencing.

GNU General Public License v3.0

10 stars 0 forks source link

BWA MEM mapping can produce no mapped reads to some contigs #40

Open cerebis opened 8 years ago

cerebis commented 8 years ago

Using BWA MEM to map WGS reads to metagenomic contigs, it is possible to find no reads mapping to a small number of short contigs. This then results in inferring contig coverages of zero and an inability to observe SNV from pileups generated from these BAM files.

Using LAST to align an example contig to the originating set of community sequences, full alignments are reported for all 8 taxa in the community. However, all possess single nucleotide gaps (from 5 to 32 SNV) that must exceed the constraints of BWA.

Experimenting with constraints in BWA MEM, removing penalties for clipping and unpaired reads has a significant impact. Skipping mate rescue and pairing also results in reads being mapped, although less.

koadman commented 8 years ago

interesting finding on the limitations of BWA MEM. LAST is a heuristic as well and surely also has limits. I guess the question becomes what is the right tradeoff between compute time and accuracy for these alignments? It's possible to do very rigorous alignment inference with pair-HMMs for example, even using MCMC to integrate over the possible model parameters but I doubt the very large increase in compute time would buy us much improvement for the problems we're trying to solve.

If it turns out BWA MEM is good enough, I suppose we would just discard the zero coverage contigs?

cerebis commented 8 years ago

Pulling out all the stops, setting penalties to zero and skipping pairing/rescue produces mappings for all contigs but also systematically higher coverage estimates for all contigs. (delta µ=7.9, sigma=4.8, median=7.2).

Looking at the distribution of coverages and considering the abundance profile going into the simulation, there are 3 major populations. At 50x community sampling, this translates into depths: 8.96, 15.4, 18.3.

Plotting delta_cov, cov1 (regular options to BWA MEM) and cov2 (as above) and simply looking at the shape of the distribution -- where we should expect significant density/peaks around these 3 depths. It appears that the more permissive mapping is more accurate, as cov1 contains little to know suggestion of the dominant taxa.

cov.pdf

koadman commented 8 years ago

hmm, i'm not sure i know enough about how samtools pileup is calculating coverage to have a sense for what to expect in the coverage distribution, assuming we are still talking about closely related strains? If not then yes, I agree there should be distinct peaks, but if closely related, mapping uncertainty seems like it could come into play and if pileup decides to spread 1 read among 3 equally good references then perhaps the coverage distribution might not have peaks anymore?

cerebis commented 8 years ago

In this example (which is just what I am using in testing) both clades are generated from the same ancestor/donor sequence -- but with different trees and profiles. So, there is definitely overlap though alpha is not tiny -- within clade diversity (measured by ANI) is around maybe 90% (as a guess)

cerebis commented 8 years ago

Note, I am using bedtools to infer coverage.

cerebis commented 8 years ago

Just leaving this here for convenience. https://www.biostars.org/p/5165/