Multi-mapping reads and low mapping efficiency

hAmbreen commented 4 months ago

I am currently using Bismark to analyze DNA methylation data from root/nodule samples from a legume plant species and am facing issues regarding multi-mapping reads and low mapping efficiencies. My PE100 libraries were prepared using TruSeq DNA Methylation kit thus, are directional. The spike-in control is lambda phage. FastQC of the samples allowed me to realize that a few bases at the 5' end were creating trouble. I have tried running Bismark in various permutations using both clipped (9N and 6N trimmed form 5' and 3'end, respectively) and non-clipped fastq files. However, in none of the attempts, my mapping efficiency could go beyond ~40-45%. The conditions I have tried for Bismark run along with their mapping efficiencies are shown in the picture Screenshot 2024-02-19 at 1 56 59 pm : Although reads are mapping (~79-90%) on the reference genome, the unique mapping efficiencies are low. A large part of the sequences are multimapping in nature. Moreover, the unique mapping goes down when clipped fastq files are used (see Condition 3 Vs 4). Even local alignment could not generate significantly high mapping statistics. Among these conditions, I get the highest mapping percent only if I relax the stringency to a scoremin of 0.6. Could you please recommend amendments or conditions to improve the mapping efficiencies of unique reads? Any insights or considerations for solving this low efficiency would be highly appreciated. I have also forwarded you an email related to this with some sample sequences in case you would like to take a look at the data. Thanks in advance.

FelixKrueger commented 4 months ago

There is quite a lot going on in this list, not quite sure I can give you a good recommendation. I have a few comments:

the TruSeq kit should be directional, correct? Could you attach the base composition plot of FastQC (of the untrimmed reads) to take a look? If it is directional, you should not use it non-directional
if the kit is directional, you shouldn't get any alignments in --pbat mode. Your number aligned reads all look fairly similar?
removing a bias from the reads seems like a good idea (if there was one), so using the correctly trimmed input files are the way to go
if the reference genome of your legume isn't all that good, you might want to relax the alignment somewhat (score_min). It is no surprise you get increasing values allowing 2 or 3 times as many mismatches as the default setting.
Do you see in the Bismark report that 30-50% of reads are getting discarded due to ambiguous mapping? Ideally could you link of the reports?

I don't think I received any email with reads, just in case you wanted to re-send them? (I would also need to know what the genome is you are trying to align to).

hAmbreen commented 4 months ago

Hi Felix, My bad! I realized that I should have explained it in a better way.

1. the TruSeq kit should be directional, correct? Could you attach the base composition plot of FastQC (of the untrimmed reads) to take a look? If it is directional, you should not use it non-directional

I agree with you that the libraries are directional and I should stick to their analysis in that mode. I tried running the non-directional mode just to be sure that it was not creating a lot of difference in mapping efficiency. I am attaching the base composition plot of the untrimmed reads for your reference.

2. if the kit is directional, you shouldn't get any alignments in --pbat mode. Your number aligned reads all look fairly similar?

Regarding the PBAT run, instead of taking the reads as paired-end, I treated them as single reads and ran in the PBAT mode trying something similar to what we do in the Dirty Harry approach.

if the reference genome of your legume isn't all that good, you might want to relax the alignment somewhat (score_min). It is no surprise you get increasing values allowing 2 or 3 times as many mismatches as the default setting. Yes, the reference genome is not very contiguous with several scaffolds.
Do you see in the Bismark report that 30-50% of reads are getting discarded due to ambiguous mapping? Ideally could you link of the reports? Yes, there are lots of sequence pairs marked as non-unique. I am attaching report from one of the libraries here. WBS-1a_bismark_alignment_bt2_PE_report.txt

I have also sent you an email with the sample reads with information about the reference genome.

Thanks again!

FelixKrueger commented 4 months ago

Thanks for providing additional details. The base composition plot is quite informative, as it shows:

that the first 8bp looks fairly different, and I would recommend trimming them prior to alignment (--clip_r1 8 --clip_r2 8)
from base ~9 onwards, the level of cytosines in the reads is ~8%; this is unusually high for directional reads, at least in a mammalian system where non-CG methylation is typically very low. Based ton the alignment report you showed you see to have ~80% methylation in CHG and CpG context, so this might actually be alright...

Looking at the alignment report you seem to have a split of roughly the following alignments:

unique: 45%
ambiguous: 45% (discarded)
unaligned: 10%

So overall, you've got >90% of reads originating from your plant, which is good news. I am afraid there isn't much we can do about the multi-mapping of reads. Either the reference genome you are working with is still a bit crude, with similar multi-mapping scaffolds still included, or it really is that repetitive....

I just discovered your email in my Spam folder, but I don't think there is a lot else I could contribute currently, let me know your thoughts.

FelixKrueger / Bismark

Multi-mapping reads and low mapping efficiency #656