FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
366 stars 101 forks source link

Mapping Efficiency is low #641

Closed AmrSaadeldin closed 6 months ago

AmrSaadeldin commented 7 months ago

I am working with whole-genome bisulfite sequencing (WGBS) data, which comprises paired-end sequences with approximately 2 billion reads per sample (1 billion for the forward read and 1 billion for the reverse read). To preprocess the data, I utilized trim-galore! to remove adapters successfully. Subsequently, I performed mapping using Bismark with the following command: This is human samples.

bismark --genome /storage/projects/amr/WGBS/referencegenome/ --parallel 8 --bowtie2 -1 Lcmerged_R1_val_1.fq.gz -2 Lcmerged_R2_val_2.fq.gz -o 95

In this process, I used a reference genome directory (referencegenome/) which contains both the unmodified reference genome in .fa format and a subdirectory named Bisulfite_Genome/. Inside Bisulfite_Genome/, there are two subdirectories: CT_conversion and GA_conversion. However, I'm uncertain if this directory structure is correct for referencing the genome in the analysis.

While the mapping process was completed without errors, I encountered a challenge related to mapping efficiency. Across multiple parallel files and samples, I consistently observed a mapping efficiency of approximately 56%. Given that the sequencing data is expected to be of high quality and depth, I am seeking guidance on how to address and improve this issue.

Any insights or recommendations to enhance the mapping efficiency in this WGBS analysis would be greatly appreciated.

Screenshot 2023-12-03 at 19 23 42

FelixKrueger commented 7 months ago

Hi @AmrSaadeldin

thanks for reaching out. As a comment, you should be aware that --parallel 8 will probably consume at least 24 cores and ~100GB of RAM. It is important that you do not exceed your system resources, as this may lead to some alignment threads getting killed by the OS, and it is easy to miss this. This doesn't mean that it happened in your case, I just wanted to make you aware of it.

There are a number of trimming and mapping recommendations, which are summarised here; to do this effectively, you do need to know which type of sequencing kit you have used. We have also compiled a list of common reasons why the paired-end mapping efficiency may be low, available here: https://felixkrueger.github.io/Bismark/faq/low_mapping/

I'd also be happy to take a look at a FastQC (html) report of your data; you could also send me ~100K sequences (untrimmed) of your samples (gzipped) via email, I could then run a few quick checks?

AmrSaadeldin commented 7 months ago

Thank you so much for your fast response! appreciated! I will send you part of the data and get back to you later today! thank you so much.

AmrSaadeldin commented 7 months ago

Hi @FelixKrueger, I just sent you the details via email. Thank yo so much for your efforts!

FelixKrueger commented 6 months ago

I believe this is all sorted. For the record, this was Accel swift data, and with appropriate trimming (--clip_r1 10 --clip_r2 15 --threeprime_clip_r1 10 --three_prime_clip_r2 10; https://felixkrueger.github.io/Bismark/bismark/library_types/) the mapping efficiency went up to 85%.