Issue with Alignment Results Using Soft-Masked Genome in Bismark

Chanspace commented 5 days ago

I am currently conducting Whole Genome Bisulfite Sequencing (WGBS) data analysis using Bismark and plan to utilize a soft-masked genome, where all repetitive and low-complexity regions are marked with lowercase letters.

During the index generation step, I observed that the index created is consistent with the unmasked genome. However, I noticed a significant difference in the results during the alignment step, specifically in the number of uniquely aligned reads. It appears that tools like Bowtie2 ignore the soft-masking, treating the lowercase letters as uppercase during alignment.

Is there a specific parameter or approach in Bismark that would allow me to achieve alignment results with the soft-masked genome that are comparable to those obtained with the unmasked genome? Any guidance or advice would be greatly appreciated!

Thank you!

FelixKrueger commented 4 days ago

To be perfectly honest, I don't exactly know whether or not Bowtie2 treats soft-masked genomes differently to unmasked genomes but I don't think it does (Google also doesn't seem to know, "how does Bowtie2 treat soft-masked index" didn't yield any great insights either).

What would you like to achieve by soft-masking repeats?

Chanspace commented 3 days ago

I'm sorry, I may not have expressed myself clearly. What I actually want to know is how to ensure consistent detection rates when using unmasked and soft-masked genomes in Bismark. The reason is that we have utilized soft-masked genomes in other omics analyses, so we hope to maintain consistency. However, we compared unmasked and soft-masked genomes in WGBS data analysis with bismark, and even though the generated indexes are the same, there are still differences in the subsequent methylation detection rates.

FelixKrueger / Bismark

Issue with Alignment Results Using Soft-Masked Genome in Bismark #705