FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
366 stars 101 forks source link

Multi-mapping reads and low mapping efficiency #656

Open hAmbreen opened 4 months ago

hAmbreen commented 4 months ago

I am currently using Bismark to analyze DNA methylation data from root/nodule samples from a legume plant species and am facing issues regarding multi-mapping reads and low mapping efficiencies. My PE100 libraries were prepared using TruSeq DNA Methylation kit thus, are directional. The spike-in control is lambda phage. FastQC of the samples allowed me to realize that a few bases at the 5' end were creating trouble. I have tried running Bismark in various permutations using both clipped (9N and 6N trimmed form 5' and 3'end, respectively) and non-clipped fastq files. However, in none of the attempts, my mapping efficiency could go beyond ~40-45%. The conditions I have tried for Bismark run along with their mapping efficiencies are shown in the picture Screenshot 2024-02-19 at 1 56 59 pm :  Although reads are mapping (~79-90%) on the reference genome, the unique mapping efficiencies are low. A large part of the sequences are multimapping in nature. Moreover, the unique mapping goes down when clipped fastq files are used (see Condition 3 Vs 4). Even local alignment could not generate significantly high mapping statistics. Among these conditions, I get the highest mapping percent only if I relax the stringency to a scoremin of 0.6. Could you please recommend amendments or conditions to improve the mapping efficiencies of unique reads? Any insights or considerations for solving this low efficiency would be highly appreciated.  I have also forwarded you an email related to this with some sample sequences in case you would like to take a look at the data. Thanks in advance.

FelixKrueger commented 4 months ago

There is quite a lot going on in this list, not quite sure I can give you a good recommendation. I have a few comments:

I don't think I received any email with reads, just in case you wanted to re-send them? (I would also need to know what the genome is you are trying to align to).

hAmbreen commented 4 months ago

Hi Felix, My bad! I realized that I should have explained it in a better way.

1. the TruSeq kit should be directional, correct? Could you attach the base composition plot of FastQC (of the untrimmed reads) to take a look? If it is directional, you should not use it non-directional

I agree with you that the libraries are directional and I should stick to their analysis in that mode. I tried running the non-directional mode just to be sure that it was not creating a lot of difference in mapping efficiency. I am attaching the base composition plot of the untrimmed reads for your reference.

Base_composition_untrimmed_reads

2. if the kit is directional, you shouldn't get any alignments in --pbat mode. Your number aligned reads all look fairly similar?

Regarding the PBAT run, instead of taking the reads as paired-end, I treated them as single reads and ran in the PBAT mode trying something similar to what we do in the Dirty Harry approach.

  1. if the reference genome of your legume isn't all that good, you might want to relax the alignment somewhat (score_min). It is no surprise you get increasing values allowing 2 or 3 times as many mismatches as the default setting. Yes, the reference genome is not very contiguous with several scaffolds.

  2. Do you see in the Bismark report that 30-50% of reads are getting discarded due to ambiguous mapping? Ideally could you link of the reports? Yes, there are lots of sequence pairs marked as non-unique. I am attaching report from one of the libraries here. WBS-1a_bismark_alignment_bt2_PE_report.txt

I have also sent you an email with the sample reads with information about the reference genome.

Thanks again!

FelixKrueger commented 4 months ago

Thanks for providing additional details. The base composition plot is quite informative, as it shows:

Looking at the alignment report you seem to have a split of roughly the following alignments:

So overall, you've got >90% of reads originating from your plant, which is good news. I am afraid there isn't much we can do about the multi-mapping of reads. Either the reference genome you are working with is still a bit crude, with similar multi-mapping scaffolds still included, or it really is that repetitive....

I just discovered your email in my Spam folder, but I don't think there is a lot else I could contribute currently, let me know your thoughts.