Extremely low mapping efficiency using bowtie2 and reference genome with many scaffolds

LauraRt1 commented 1 year ago

Hi! I'm I'm using Bismark 0.24.2 with my WGBS samples (paired-end reads and not trimmed with TrimGalore cause I'm not a fan of trimming sequences).

In genome preparation (bismark_genome_preparation) I'm using an assembly (with many scaffolds in the same fasta file and same genotype as my samples) and Bowtie2 as aligner, but the mapping efficiency is really low, just 20%.

Trying to solve this low efficiency I used bowtie2 out of Bismark (parameter --very-sensitive I had almost 55%) but it is still really low, isn't it?

Using bwa-meth the mapping rate is almost 95% but these SAM files are not compatible with bismark (I'm having troubles resolving the compatibility issue between the bwa-meth SAM files and Bismark).

I'd like to use Bismark for the entire workflow (from mapping to methylation extraction) but I don't know how to increase the mapping efficiency

Thanks!!

FelixKrueger commented 1 year ago

Thanks for reaching out.

By default, Bismark runs in end-to-end mode in a fairly stringent setting (--score_min L,0,-0.2) which makes it essential that you provide it with quality- as well as adapter-trimmed input data (it is somewhat insensitive towards your personal preferences in this regard). If you could yourself to try out trim_galore --paired file1 file2 (potentially in conjunction with somewhat more relaxed mapping parameters such as --score_min L,0,-0.4) you should see a much increased alignment rate. I you wanted to send me the sample FastQC profiles of your input files I could see if I can spot any measure that might be useful.

If you really don't like trimming at all you could use the local alignments (envoked using the option --local), which should behave very similarly to what bwa-meth does. As a word of caution though, local alignments will not remove low-quality base calls, and will most likely include incorrectly filled-in sequences that should really be removed prior to mapping (e.g. for the Accel Swift kit).

Very happy to assist you along the way, we can also continue this via email.

LauraRt1 commented 1 year ago

Thank you very much for your quick response!

I've been testing by extracting data from a pair of paired-end reads, 10,000 lines each (equivalent to 2,500 fastq) to get results as quickly as possible. From now on, all the results are from these fastq files of 10,000 lines (2500 fastq).

I trimmed my sequences with trim_galore, and then I aligned them with Bowtie2, changing the parameter --score_min L,0,-06. However, the mapping efficiency remains at 22%.

If I use trimmed sequences and add --very-sensitive-local it increases to 31%.

If I only use one read, fastq_1 (trimmed and --very-sensitive-local), the mapping efficiency increases to 76%. On the other hand, if I use only fastq_2 (trimmed and --very-sensitive-local), it is 55.9%. I used the --pbat parameter only on fastq_2, and it increases to 51.6%.

Anyway, a 50% mapping efficiency is still low. I don't understand why when I use both sequences together, it decreases to 20%.

Should I map all my fastq files as if they were single-end reads? Can I use the --pbat parameter with paired-end reads? I'm not sure how to proceed.

Thanks for your help

FelixKrueger commented 1 year ago

Hi Laura, if possible at all, could you send me a subset of the reads of say 100K reads for some quick tests on my side?

changchuanjun commented 9 months ago

Hello，are you here? I have met the similar diffcuilties as you, I am very confused, I want to know have you solved this problem? I am sincerely looking forward to your reply.

LauraRt1 commented 9 months ago

Hi changchuanjun! I solved the problem by using Trim-Galore and the parameter --score_min L0-0.6. I finally obtained a mapping efficiency between 70 and 75 %.

FelixKrueger commented 9 months ago

You can find a few ways of interpreting and mitigating low mapping efficiency issues here: https://felixkrueger.github.io/Bismark/faq/low_mapping/. If this doesn't solve your problems feel free to send me an email with as many details as possible.

changchuanjun commented 9 months ago

Hello LauraRt1 and FelixKrueger! thanks for your reply! In my data process pipeline,I used fastp to remove adapter, this is my cleaned data state fastq_state

actually,I used bismap to analyse my data firstly ,but the mapping efficiency is very low, just 0,very surprising, and my bismap command and result below bsmap__command bsmap_result

So I decided to change the software, But when I used bismark's paired mode to execute the script with the default parameter,the result was similar to bismap,the mapping efficiency was 0. I refered related issues and modified my script,I used bismark's single mode to run the script with one fq1.gz,the mapping efficiency result increased significantly from 0 to 54%,the command the result below.

bismark_command bismark_result

meanwhile,I used bismark's single mode to treat fq2.gz with the parameter --pbat,the mapping efficiency result was 57%,similar to fq1.gz

I am confused why so big difference between the paired mode and single mode in biamark mapping efficiency? Did this result mean the sequencing data have some error? How to combine the result of fq1.gz and fq2.gz

Looking forward to your reply sincerely

FelixKrueger commented 9 months ago

The most likely issue here is that the trimming you performed somehow interfered with the synchronisation between R1 and R2: both files need to have corresponding sequences in both files, and have the exact same number of reads in total.

I would recommend using the raw FastQ files once more, and using Trim Galore:

trim_galore --paired sample1_R1.fastq.gz sample1_R2.fastq.gz

followed by a paired-end mapping command (maybe you could relax the mapping stringency a little, e.g. to --score_min L,0,-0.4. This should really result in a very similar mapping efficiency you observe with single-end alignments. Which type of kit did you use?

changchuanjun commented 9 months ago

Hello,I am very delighted to receive your reply.I am so sorry to reply your message lately because of Spring Festival (Chinses New Year) vacation. I have found the reason why the mapping efficiency is very low(just 0).I used incorrect command when I ran the fastp script to clean the raw data. the detailed information below.The paired input file -I and -i are the same file. cf9dc67e70694c340f9579f16a5d944 1708691770071

when I used the correct command to execute my script,I found there was just little different no matter fastp or trim_galore in bismark's mapping efficiency.

fastp: 1708690102511 1708690216424 1708691602171

trimgalore: 1708691127475 1708691315558 1708691215145

FelixKrueger commented 9 months ago

So can we say this is a success all-around? All the best going foward!

changchuanjun commented 9 months ago

Yes, exactly. Besides, I also used bsmap software to run pipeline.It also demonstrated there was just a little different no matter fastp or trim_galore to treat the raw data in final mapping efficiency result.

1708696192841 1708696124016

Interestingly, bsmap's mapping efficiency was sightly high than bismark's mapping efficiency when used the same input file.

FelixKrueger / Bismark

Extremely low mapping efficiency using bowtie2 and reference genome with many scaffolds #639