FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
366 stars 101 forks source link

MAPQ Filtering Post Alignment Using --local Parameter #646

Open allenloong opened 6 months ago

allenloong commented 6 months ago

Hi, I am currently using Bismark for my DNA methylation analysis and have a question regarding the post-alignment filtering process. Specifically, I'm using the --local parameter for alignment, which I understand allows for more flexible alignments.

My question is about the necessity and implications of filtering alignments based on their MAPQ scores post-alignment. In my case, is it advisable to filter out alignments with a MAPQ score less than 40? I am aware that such filtering can help remove low-quality or ambiguous alignments, but I am also concerned about potentially losing valuable data.

Could you provide guidance or recommendations on this? Any additional insights or considerations I should be aware of when deciding on MAPQ thresholds.

Thanks. Allen

FelixKrueger commented 6 months ago

Dear Allen,

I have to admit that I can't really offer any useful advice on filtering on MAPQ values in locally aligned data, as we typically performing adapter/quality trimming, followed by no further filtering at all as we assume that poor quality data has been removed, and Bismark does not report perfectly multi-mapping reads anyway. Here is a blog post on the rationale for global alignments.

I you want to go down a local/filtering route I. assume general rules apply, see some considerations on MAPQ implementation here.

allenloong commented 5 months ago

Dear Felix,

Thank you for your prompt response. I've carefully reviewed the two blog posts you referenced, which, actually, inspired me to implement post-alignment filtering. Utilizing SeqMonk, I noticed an increase in read counts in local mode. Interestingly, a comparison of alignments between local and global modes revealed a widespread increase in reads across the entire genome.

I’m not sure if there is any rule on defining "outliers" in 2Kb window analyses. Additionally, I'm curious about the validity of using correlation as a metric to establish MAPQ thresholds when comparing local and global alignments.

Thanks, Allen