Pair-end mode has low mappability; but single mode has high mappability.

FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states

http://felixkrueger.github.io/Bismark/

GNU General Public License v3.0

386 stars 101 forks source link

Pair-end mode has low mappability; but single mode has high mappability. #477

Closed freyayang closed 2 years ago

freyayang commented 2 years ago

Hi, I used bismark to do the alignment, but the 'Mapping efficiency' varies from 19.1% to 48.4%. My reads are pair-end, genome is Oryza Sativa. I read some previous issues and your answers about low alignment efficiency problem. So I tried to align in single-end mode with the sample which has extreme low mapping efficiency (19%).

I aligned R1 with directional mode, command as follow: bismark --genome /data1/ref/os/ ZH11-2_L2_A002.cutadapt.1.fastq R1 mappability: 48.9%

then R2 with --pbat mode: bismark --pbat --genome /data1/ref/os/ ZH11-2_L2_A002.cutadapt.2.fastq R2 mappability：38.1%

Both single reads seems have pretty good mappability. But when run in pair-end mode, the mappability is low (19%). How should I solve this problem? Should I run methylation_exactor of both single end? Does it matter if some of my samples go through pair-end mode, but some of samples go through single-end mode?

Thank you so much!

Best,

FelixKrueger commented 2 years ago

Given that your R1 achieves a 10% higher mapping rate is indicative of technical issues with R2. Could you post the the R1 and R2 base composition plots (of FastQC here to take a look?), this could also be caused by the Accel Swift library prep kit. Here are all the usual remedies I attempt for PE data, before I consider a single-end strategy: https://github.com/FelixKrueger/Bismark/blob/master/Docs/FAQ.md#low-mapping-effiency-of-paired-end-bisulfite-seq-sample.

Generally, 2 single-end alignment in the way you posted would also work. One thing this doesn't take account of are situations where R1 and R2 are overlapping, thus producing an inflated base coverage of overlapping regions (these are removed in paired-end data). You might observe a slight difference in mapping between SE and PE data, so if you can avoid it at all I would try to stick to the same method in order to minimise systematic biases. But technically, once single cell methylation calls have been extracted, there is no difference between SE or PE data.

freyayang commented 2 years ago

Thank you for the reply. The base composition plots (obtained from FastQC 'per_base_sequence_content') result is as follow:

R1.fastq without trim adapter: per_base_sequence_content R2.fastq without trim adapter:

I will try other suggestions in your link. I trimed the adapter with cutadapt but not 'TrimGalore'. And I will try to loose the stringency.

If I use the SE mode to do the mapping, and extract the methylation of both single-end alignments. Then how to merge two .cov files? Can I just simply sum the methylated Cs (or unmethylated Cs) of R1 and R2 files?

Sincerely,

FelixKrueger commented 2 years ago

Thanks for the composition plots; this doesn't look like the Accel Swift kit, but there are some biased bases in there nevertheless. I would probably consider to trim off a few bases from the 5' ends to see if that has a beneficial impact on mapping (e.g. --clip_r1 6 --clip_r2 6, or 10bp?).

In terms of merging, yes you can simply sum up the calls from both, but keep in mind that overlapping portions of reads will receive 2 (identical) methylation calls, giving them an apparent deeper coverage (I personally would probably try to make the paired-end experiment work as well as possible, or use R1 as single-end experiment only).

freyayang commented 2 years ago

Hi Felix,

I trimed off 10 bp from 5' ends with Trim-galore as you suggested.

Command is as below: trim_galore --phred33 --paired --illumina --stringency 3 --clip_R1 10 --clip_r2 10 -j 2 ZH11-2.R1.fastq ZH11-2.R2.fastq
Fastqc results showed better base composition:

But the alignment results is still 19.1% mapping efficiency.

And I also tried to loose the stringency when aligning:

bismark --score_min L,0,-0.6 --multicore 3 --genome /data1/ref/os/ -1 ZH11-2.cutadapt.1.fastq -2 ZH11-2.cutadapt.2.fastq

The mapping efficiency became 24.1% this time.

Do you have any other idea to make the experiment work better?

Best,

FelixKrueger commented 2 years ago

Hmm, so it seems that your single-end mapping efficiencies are nearly twice the paired-end efficiency... You could try increase the fragment size (e.g. -X 1000) to see if your fragments are larger than the default allowed maximum fragment size of 500bp.

Alternatively, chimeric reads could be the reason (see here: https://sequencing.qcfail.com/articles/pbat-libraries-may-generate-chimaeric-read-pairs/). Could try to see if --local would help with that? Of course there might also be contaminants in the file, which may be tricky to spot. If you wanted you could send me a small portion of the untrimmed FastQ files (no more than 200K reads, .gz compressed), and I could run it through our species contamination screen?

freyayang commented 2 years ago

Here is the fastq files of R1 and R2 less than 200K. It's so kind of you to take a look!

ZH11-2_L2_A002.R1.cut.fastq.gz ZH11-2_L2_A002.R2.cut.fastq.gz

Thank you so much! Best,

FelixKrueger commented 2 years ago

I have now run a few tests over here, and it does indeed look like the paired-end mapping efficiency keeps being low (16-22% with increasingly relaxed mapping parameters). I also tried trimming off 15bp from both ends of the reads, but that didn't improve the results substantially.. Specifying -X 1000 has no effect on this, so the low efficiency is not caused by overly long fragments in the your library.

Using FastQ Screen I could also not find any of the usual contaminants we typically test for.

I then moved on to run single-end alignments, and got 40, 50 and 60% for R1, and 37, 45 and 50 % for R2 (in --pbat mode). This shows that there is something within R2 that causes it to align worse when compared to R1, and I would imagine that this is also causing issues with the paired-end alignment.

You do now have have several options:

you could align the data as single-end, but as pointed out above this would mean that you might over-represent regions of reads where R1 and R2 overlap
you could align the data in PE mode first, while specifying --unaligned. This would allow you to treat the PE files on their own (i.e. with overlap removal), and you could then take the unmapped reads and align those in single-end mode.
You could try to use the option --local, which would perform local alignments instead of end-to-end alignments. This mode increased the mapping efficiency to 70% for the SE mode, and 48% in paired-end mode. I would probably try to understand where the reads align that do not align in end-to-end mapping.

So in a nutshell: the data doesn't raise any particular red flags and I can't see anything obvious that would explain why the mapping efficiency isn't great :(

freyayang commented 2 years ago

Hi Felix，

Thanks a lot for putting in so much effort.

I will use the PE mode with loosed stringency to do the alignment, and remove the overlapped region. Then align the unmapped reads with SE mode. However, I guess this will still produce an inflated base coverage.

For the --local option, I do not well understand. I read the article about chimeric read pairs, I saw it is the problem for PBAT and single-cell seq, but my library is WGBS. Does --local option mean R1 reads and R2 reads may map to different location during mapping, but the 48% mapping efficiency you mentioned are the reads that R1 and R2 mapped to the same location? Seems --local option can give better results than the previous method?

Thanks again for your suggestions.

Best,

FelixKrueger commented 2 years ago

I will use the PE mode with loosed stringency to do the alignment, and remove the overlapped region. Then align the unmapped reads with SE mode. However, I guess this will still produce an inflated base coverage.

Maybe a little, but the results should be certainly better than just SE alignments straight away. Bu arguably, this method is more work as it requires you to process a portion of the data to as PE, the other one as SE (you can merge the methylation calls at the bismark2bedGraph stage to generate a single coverage file).

What the --local option does is to allow reads to get soft-clipped on the 5', 3', or both, ends. This may help whenever something weird or unexpected happens in your library prep, or possibly also when the reference genome is very polished yet and the organism you are sequencing is quite divergent from the reference. What it does not allow still is to align a read or read pair to more than one location at the same time. This may happen however if you take R1 and R2 for reads that do not align in paired-end mode, but they might end up in different locations in single-end mode. If you wanted to you could investigate this in more detail, maybe it will tell you something about what went wrong during the library prep...? Then again, it might not be very eye opening....