FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
394 stars 103 forks source link

Trimming for Zymo Pico Methyl-Seq #398

Closed franrodalg closed 3 years ago

franrodalg commented 3 years ago

Hi Felix,

I'm trying to align a publicly available dataset, and so far I'm getting quite abysmal alignment rates (around 8%). Doing a bit of research, I noticed that the bisulfite conversion kit they used appears listed as one of the special cases in the documentation (see https://github.com/FelixKrueger/Bismark/tree/master/Docs#zymo-pico-methyl-seq). It mentions there that alignment should be performed using the "--non_directional" option, but I wonder if that's also the case in trim_galore. I've tried to do so, but I've then been asked to request --rrbs mode as well. Does that make sense? Or the trimming doesn't need to be "non-drectional"?

Cheers, Fran

FelixKrueger commented 3 years ago

Hi Fran,

As far as I am aware the Pico-kit does need fairly extensive trimming (see below), but the option for non-directional trimming is only relevant for RRBS samples. I think a good start with the trimming would be:

trim_galore --paired --clip_r1 10 --clip_r2 10 --three_prime_clip_r1 10 --three_prime_clip_r2 10 file1.fastq.gz file2.fastq.gz

But you can also look at the FastQC base composition profile, and/or the M-bias profile to trim even more if needed.

The alignments then need to be carried out in Bismark using --non_directional. Fingers crossed the mapping will be a little better!?

franrodalg commented 3 years ago

Thanks, Felix. I'll try as you suggest and let you know!

franrodalg commented 3 years ago

Hi Felix! It improved a bit, but not substantially. The mapping efficiency now is around 14%. Would you suggest trimming even further?

FelixKrueger commented 3 years ago

Hmm, trimming is most likely not going to help you a lot more, if you could attach the untrimmed and trimmed FastQC html profiles I could maybe take a look to see if I can spot any other problems...

I would probably recommend you read the considerations here: https://github.com/FelixKrueger/Bismark/blob/master/Docs/FAQ.md#thoughts-and-considerations-regarding-single-cell-and-pbat-libraries-september-18-2019 to see if one of the options works in your hands?

franrodalg commented 3 years ago

In the original paper they use a pretty old version of Bismark (version: 0.7.12; parameter settings: ‘-n 2 -l 40’, which I think means they used bowtie, not bowtie2, under the hood). Not sure if that should affect in any way the success of the alignment.

I attach the htmls for the fastqc reports. Hope that helps.

fastqcs.zip

Cheers!

FelixKrueger commented 3 years ago

Hi Francisco,

I have spent a bit more time on this, here are my findings:

Final Alignment report
======================
Sequences analysed in total:    52718120
Number of alignments with a unique best hit from the different alignments:  34232576
Mapping efficiency: 64.9%
Sequences with no alignments under any condition:   13338553
Sequences did not map uniquely: 5146991
Sequences which were discarded because genomic sequence could not be extracted: 14

Number of sequences with unique best (first) alignment came from the bowtie output:
CT/CT:  17447828    ((converted) top strand)
CT/GA:  16784734    ((converted) bottom strand)
GA/CT:  0   (complementary to (converted) top strand)
GA/GA:  0   (complementary to (converted) bottom strand)

I wonder whether aligning this data to a 129 genome (could be constructed easily using the SNPsplit_genome_preparation) would be a little better?

Final Alignment report
======================
Sequences analysed in total:    200000
Number of alignments with a unique best hit from the different alignments:  13873
Mapping efficiency: 6.9%
Sequences with no alignments under any condition:   152886
Sequences did not map uniquely: 33241
Sequences which were discarded because genomic sequence could not be extracted: 0

Number of sequences with unique best (first) alignment came from the bowtie output:
CT/CT:  453 ((converted) top strand)
CT/GA:  503 ((converted) bottom strand)
GA/CT:  6431    (complementary to (converted) top strand)
GA/GA:  6486    (complementary to (converted) bottom strand)

The attached MultiQC report for SE mode was carried out in default (directional) mode, so please disregard this. However in the FastQ Screen plot, Bismark is run in non-directional mode, which shows that R2 aligns to mouse in only around 25%.

multiqc_report.zip

In short: This library looks well messed up, but I am afraid I cannot offer good reason why this is so. Maybe it would be worth getting back to the authors about it? For the time being, I think aligning Read 1 in single-end mode might yield pretty good results. But given that something very weird must have happened to the library in the first place I am not sure whether you could really trust this sample to be SRR5504109_Serum_LIF_-GMEM_129_MethylC-seq as it says on the tin....

I hope this helps?

franrodalg commented 3 years ago

Oh, wow, Felix! That's amazingly useful, thanks!!!

I think we're going to discard this dataset altogether. As you say, even if we aligned only Read 1 in single-end mode, we couldn't trust that the library is what it is supposed to be.

Cheers!!!

FelixKrueger commented 3 years ago

I would almost certainly do the same (and contact the authors so they fix it somehow :)). Good luck!