brentp / bwa-meth

fast and accurate alignment of BS-Seq reads using bwa-mem and a 3-letter genome
https://arxiv.org/abs/1401.1129
MIT License
139 stars 53 forks source link

bwa-meth and samtools flagstat #42

Open anusurendra opened 7 years ago

anusurendra commented 7 years ago

Hi, I used bwa-meth to align rust infected wheat samples to the rust genome. However, when I ran samtools flagstat I get the follow output:

22806328 + 4861610 in total (QC-passed reads + QC-failed reads)
154024 + 1050636 secondary
0 + 0 supplementary
0 + 0 duplicates
22634029 + 4846318 mapped (99.24% : 99.69%)
22652304 + 3810974 paired in sequencing
11326152 + 1905487 read1
11326152 + 1905487 read2
22435338 + 0 properly paired (99.04% : 0.00%)
22467696 + 3780390 with itself and mate mapped
12309 + 15292 singletons (0.05% : 0.40%)
30982 + 232010 with mate mapped to a different chr
13721 + 50746 with mate mapped to a different chr (mapQ>=5)

I didn't know BWA-meth gave secondary alignments. Also, the total number of QC-passed reads + QC-failed reads (27480347) is greater than the total of the pair-end reads (26463278).

brentp commented 7 years ago

Hi, what is your question?

anusurendra commented 7 years ago

Apologizes Brent for not being clear,

When I was looking at the flagstat, I just found that the secondary alignment values weird. It seems that there are large portion have more than one alignment. But it may have to do with the fact that the sample has both wheat and rust. Thanks again for your patience and help. As a side note, I found BWA-meth is great and has better alignments than Bismark!!!

nchernia commented 6 years ago

I think this is due to the issue I just posted, #50

BWA sets as QC-failed when reads are soft-clipped, see https://bioinformatics.stackexchange.com/questions/2988/interpreting-0x200-flag-in-bwa-mem-alignments

I ran BWA without the -p flag but with the other flags and had no QC failed alignments.

nchernia commented 6 years ago

Actually it seems this is due to lines 361-390 in the master script.

tuuba06 commented 5 months ago

Hi, For DNA methylation analysis, I used bwa-meth to align reads from poppy samples to the poppy genome. When I run samtools flagstat I get the following output. There are fairly high QC-failed reads. How can I reduce QC-failed reads? Thanks. @brentp

36809078 + 31632003 in total (QC-passed reads + QC-failed reads) 26158 + 118693 secondary 0 + 0 supplementary 0 + 0 duplicates 34068600 + 31567323 mapped (92.55% : 99.80%) 36782920 + 31513310 paired in sequencing 18391460 + 15756655 read1 18391460 + 15756655 read2 33895914 + 0 properly paired (92.15% : 0.00%) 34013748 + 31383950 with itself and mate mapped 28694 + 64680 singletons (0.08% : 0.21%) 66884 + 108786 with mate mapped to a different chr 18024 + 44290 with mate mapped to a different chr (mapQ>=5)

brentp commented 5 months ago

Hi, you can try the --do-not-penalize-chimeras argument.

tuuba06 commented 5 months ago

Hİ, Thank you for your suggestion. I can try the --do-not-penalize-chimeras argument. QC-failed reads reduced. I am adding the same sample samtools flagstat output below. I am doing analysis with multiple samples. QC fail reads were quite high on 3 of my samples compared to my other samples. Does the --do-not-penalize-chimeras argument affect the accuracy of my data? What effect does it have? Should I use the --do-not-penalize-chimeras argument in my other samples as well? Can you advice? @brentp Thanks

68441081 + 0 in total (QC-passed reads + QC-failed reads) 144851 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 65635923 + 0 mapped (95.90% : N/A) 68296230 + 0 paired in sequencing 34148115 + 0 read1 34148115 + 0 read2 65152402 + 0 properly paired (95.40% : N/A) 65397698 + 0 with itself and mate mapped 93374 + 0 singletons (0.14% : N/A) 175670 + 0 with mate mapped to a different chr 68445 + 0 with mate mapped to a different chr (mapQ>=5)

brentp commented 5 months ago

Hi, without --do-not-penalize-chimeras, then bwameth does this: https://github.com/brentp/bwa-meth/blob/7c8e1eb2a3b17b1bc6ecbef62119bf422d131830/bwameth.py#L436-L444 so it find reads where there is a lot of soft-clipping. This can happen with BS-Seq and usually leads to lower quality results. You'll have to decide for yourself when to use the argument.