brentp / bwa-meth

fast and accurate alignment of BS-Seq reads using bwa-mem and a 3-letter genome
https://arxiv.org/abs/1401.1129
MIT License
141 stars 54 forks source link

Mapping Rate From bwa-meth #59

Closed groverj3 closed 5 years ago

groverj3 commented 5 years ago

Hey Brent,

It's been a while since I used bwa-meth, but I've recently generated a bunch of WGBS data and I was comparing alignment stats from mapping with Bismark vs bwa-meth on our real data (some other mappers too, but I'm unlikely to proceed with those).

For one of the samples Bismark reported an alignment rate of 58.8% in its output log file. My understanding is that Bismark only considers an alignment for downstream mC calling if it is unique and both reads of a pair are aligned.

I aligned the same sample with bwa-meth and the samtools flagstat output indicates that 96.92% are properly paired (both mates mapped to the same chromosome, etc.). This seemed unrealistically high to me. I wasn't sure if samtools flagstat was including multimapped reads so I went ahead and ran samtools view -c -F 0xF00 -f 66. Which should equate to counting the alignments which are NOT supplemental, secondary, QC-failed, or duplicates (MethylDackel uses those same ignore sam flags), and ARE the first mate of a pair, and are aligned in a proper pair. This resulted in an alignment rate of 96.5%. Still suspiciously high.

How would you recommend calculating the alignment rate after mapping with bwa-meth? Or is it really that much better?!

Best, Jeff

astatham commented 5 years ago

With bwa-meth you need to mark duplicates separately - see https://github.com/brentp/bwa-meth/issues/54

groverj3 commented 5 years ago

Not what I'm asking about. Neither does Bismark, you have to remove them after mapping just like bwa-meth. That is the "alignment rate" pre-deduplication. The next step in my pipeline is marking duplicates with Picard.

Alignments should be marked as secondary, etc. after alignment, correct? And filtering on those flags in the sam/bam file should get me the uniquely aligning reads and therefore allow me to calculate a mapping rate similarly to the way Bismark reports one. Unless I'm mistaken.

I'm just concerned that a mapping rate that high means I'm missing something obvious.

brentp commented 5 years ago

are you running bismark with bowtie2? If so, I'm suprised the difference is that high. If you are running with bowtie 1, then it's not that suprising. You might do some stats on the bams to see how high the split-read rate is and check for other odd stats. bwa mem will try very hard to map reads, even if they shouldn't be mapped, having lots of split reads can be a sign of that.

groverj3 commented 5 years ago

My thoughts exactly on the bowtie2 thing, I was very surprised the difference was that high. I did notice that if I include the non-uniquely mapping reads (no "best" alignment) Bismark's alignment rate is higher, and closer to my bwa-meth rate at 90.2%. I thought there might be some fundamental difference in how bismark decides something has a "best" alignment (more conservative) vs whatever bwa-mem is doing under the hood. Or, I'm just not filtering the bam correctly (simplest solution to such things is usually user error!).

I'll check out some other stats as you suggest. Split read/discordant alignment rate would probably be informative.

groverj3 commented 5 years ago

I believe I have this figured out.

For some reason, I think reads with a MAPQ of zero were included in samtools flagstat output. Maybe flags were not being set on them, although according to bwa-mem's MAPQ scoring scheme they should be multimapping reads with no "best" alignment.

I thought Felix Krueger's posts on QCFail might have something to do with this: https://sequencing.qcfail.com/articles/soft-clipping-of-reads-may-add-potentially-unwanted-alignments-to-repetitive-regions/ and https://sequencing.qcfail.com/articles/mapq-values-are-really-useful-but-their-implementation-is-a-mess/. So, I went ahead and plotted the MAPQ scores from my aligned libraries.

image

However, I don't agree with him that it's necessarily that soft-clipped alignments are really a problem. As those only were present at 0.3% of total alignments.

Devon Ryan's MethylDackel uses a MAPQ score fliter of >= 10 for methylation calling. So, I instituted that same filter for calculating the mapping rate. Really, it should just remove multiply-mapping reads with very low confidence in a "best" alignment. I counted reads with the following:

samtools view -h -q 10 $bam_file | samtools sort -n | samtools fixmate - - | samtools view -c -F 3840 -f 66

From this it looks like the mapping rate for proper pairs (no secondaries, no chimeric, qc fails, no duplicates) with a MAPQ >= 10 is 55.5%. This is pretty close to Bismark's post-deduplication rate of 56.7%.

groverj3 commented 5 years ago

A bwa-meth + picard MarkDuplicates + MethylDackel extract pipeline still runs in about half the time compared to Bismark, so I'm going to continue to use it. I just figured I'd update you in case others get confused about mapping rates. Or if you wanted to update anything in regards to how alignments get flagged in the output, in case you are planning on doing further updates.

PanosProv commented 11 months ago

@groverj3 Thanks so much for your post, I was struggling to justify the extremely high mapping rates produced by BWA mem. Still things to figure out but this good info :)