FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
386 stars 101 forks source link

PBAT libraries and read number #422

Closed mo716 closed 3 years ago

mo716 commented 3 years ago

Hi!! I am struggling a bit with the alignment of my 150 paired-end read PBAT libraries. I removed any adapters and trimmed 9 basepairs from the 5' and 3' end of both reads using TrimGalore v.0.6.6 as follows:

trim_galore -j 4 --paired --basename ${OUTPUT} --clip_r1 9 --clip_r2 9 --three_prime_clip_r1 9 --three_prime_clip_r2 9 --path_to_cutadapt /PATH/TO/CUTADAPT ${LIB1} ${LIB2}

Then I aligned the reads with Bismark v0.23.0 using the --pbat and --unmapped options as shown below:

bismark --parallel 4 --pbat --unmapped -p 20 --genome /PATH/TO/GENOME/ -1 ${READ1} -2 ${READ2}

Nevertheless, the average mapping efficiency for most libraries with this method is around 0.2%. Here is one example of a PE alignment report for sample E11:

... Bismark was run with Bowtie 2 against the bisulfite genome of /PATH/TO/GENOME/ with the specified options: -q --score-min L,0,-0.2 -p 20 --reorder --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--pbat' specified: alignments to original strands (OT, OB) were ignored (i.e. not performed)

Final Alignment report

Sequence pairs analysed in total: 4551369 Number of paired-end alignments with a unique best hit: 7183 Mapping efficiency: 0.2% Sequence pairs with no alignments under any condition: 4543120 Sequence pairs did not map uniquely: 1066 Sequence pairs which were discarded because genomic sequence could not be extracted: 0

Number of sequence pairs with unique best (first) alignment came from the bowtie output: CT/GA/CT: 0 ((converted) top strand) GA/CT/CT: 3506 (complementary to (converted) top strand) GA/CT/GA: 3677 (complementary to (converted) bottom strand) CT/GA/GA: 0 ((converted) bottom strand) ...

I proceeded with aligning the unmapped reads separately as single end reads for the same sample (E11) as the one above based on the "Dirty Harry" method:

bismark --parallel 4 --pbat -p 20 --genome /PATH/TO/GENOME/ --se E11_unmapped_reads_1.fq.gz bismark --parallel 4 -p 20 --genome /PATH/TO/GENOME/ --se E11_unmapped_reads_2.fq.gz

The mapping efficiency for the SE alignments were still very low:

Unmapped READ1 SE output (--pbat)

... Option '--pbat' specified: alignments to original strands (OT and OB) strands were ignored (i.e. not performed) Bismark was run with Bowtie 2 against the bisulfite genome of /PATH/TO/GENOME/ with the specified options: -q --score-min L,0,-0.2 -p 20 --reorder --ignore-quals

Final Alignment report

Sequences analysed in total: 4544186 Number of alignments with a unique best hit from the different alignments: 30081 Mapping efficiency: 0.7% Sequences with no alignments under any condition: 4504678 Sequences did not map uniquely: 9427 Sequences which were discarded because genomic sequence could not be extracted: 0

Number of sequences with unique best (first) alignment came from the bowtie output: CT/CT: 0 ((converted) top strand) CT/GA: 0 ((converted) bottom strand) GA/CT: 15054 (complementary to (converted) top strand) GA/GA: 15027 (complementary to (converted) bottom strand) ...

Unmapped READ2 SE output (--directional)

... Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e. not performed) Bismark was run with Bowtie 2 against the bisulfite genome of /PATH/TO/GENOME/ with the specified options: -q --score-min L,0,-0.2 -p 20 --reorder --ignore-quals

Final Alignment report

Sequences analysed in total: 4544186 Number of alignments with a unique best hit from the different alignments: 29948 Mapping efficiency: 0.7% Sequences with no alignments under any condition: 4503955 Sequences did not map uniquely: 10283 Sequences which were discarded because genomic sequence could not be extracted: 0

Number of sequences with unique best (first) alignment came from the bowtie output: CT/CT: 14939 ((converted) top strand) CT/GA: 15009 ((converted) bottom strand) GA/CT: 0 (complementary to (converted) top strand) GA/GA: 0 (complementary to (converted) bottom strand) ...

Since this looked a bit strange, I then repeated the SE alignment with the --pbat option for READ2 instead (and directional or default for READ1). Here are the results:

Unmapped READ1 SE OUPUT (--directional)

.. Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e. not performed) Bismark was run with Bowtie 2 against the bisulfite genome of /PATH/TO/GENOME/ with the specified options: -q --score-min L,0,-0.2 -p 20 --reorder --ignore-quals

Final Alignment report

Sequences analysed in total: 4544186 Number of alignments with a unique best hit from the different alignments: 2767606 Mapping efficiency: 60.9% Sequences with no alignments under any condition: 376019 Sequences did not map uniquely: 1400561 Sequences which were discarded because genomic sequence could not be extracted: 1

Number of sequences with unique best (first) alignment came from the bowtie output: CT/CT: 1382667 ((converted) top strand) CT/GA: 1384938 ((converted) bottom strand) GA/CT: 0 (complementary to (converted) top strand) GA/GA: 0 (complementary to (converted) bottom strand) ...

Unmapped READ2 SE output (--pbat)

... Option '--pbat' specified: alignments to original strands (OT and OB) strands were ignored (i.e. not performed) Bismark was run with Bowtie 2 against the bisulfite genome of /PATH/TO/GENOME/ with the specified options: -q --score-min L,0,-0.2 -p 20 --reorder --ignore-quals

Final Alignment report

Sequences analysed in total: 4544186 Number of alignments with a unique best hit from the different alignments: 2760760 Mapping efficiency: 60.8% Sequences with no alignments under any condition: 378322 Sequences did not map uniquely: 1405104 Sequences which were discarded because genomic sequence could not be extracted: 2

Number of sequences with unique best (first) alignment came from the bowtie output: CT/CT: 0 ((converted) top strand) CT/GA: 0 ((converted) bottom strand) GA/CT: 1378930 (complementary to (converted) top strand) GA/GA: 1381828 (complementary to (converted) bottom strand) ...

The mapping efficiency improved and raised to 60%. So I thought that maybe there was a problem with READ1 actually being READ2 and vice-verse, but that is not the case. I checked the fastq file info for READ1 and they all have a "1" in the 8th column info.; whereas READ2 has a "2".

On a final attempt to understand this better, I ran the PE alignment again for sample E11 but with READ1 as READ2 and vice-verse:

bismark --parallel 4 --pbat --unmapped -p 20 --genome /PATH/TO/GENOME/ -1 ${READ2} -2 ${READ1}

The output from this is:

... Bismark was run with Bowtie 2 against the bisulfite genome of /PATH/TO/GENOME/ with the specified options: -q --score-min L,0,-0.2 -p 20 --reorder --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--pbat' specified: alignments to original strands (OT, OB) were ignored (i.e. not performed)

Final Alignment report

Sequence pairs analysed in total: 4551369 Number of paired-end alignments with a unique best hit: 2954431 Mapping efficiency: 64.9% Sequence pairs with no alignments under any condition: 592466 Sequence pairs did not map uniquely: 1004472 Sequence pairs which were discarded because genomic sequence could not be extracted: 2

Number of sequence pairs with unique best (first) alignment came from the bowtie output: CT/GA/CT: 0 ((converted) top strand) GA/CT/CT: 1475835 (complementary to (converted) top strand) GA/CT/GA: 1478594 (complementary to (converted) bottom strand) CT/GA/GA: 0 ((converted) bottom strand) ...

This looks quite decent, without having to go trough the "Dirty Harry" method. I am working with an allopolyploid, highly repetitive plant genome, so mapping efficiencies above 60% for a methylation based alignment are quite good! Nevertheless, I do not understand why READ1 seems to function as READ2 and the other way around. The kit used for the bisulfite convertion is EZ DNA Methylation Gold Kit (Cat No. D5005) from Zymo Research.

I would really appreciate your feedback on this. I did a fastqc for all libraries and merged them with multiqc, so I can send it to you if it helps. Thank you very much for your patience and any help! Cheers,

FelixKrueger commented 3 years ago

Hi @mo716,

From what you are describing, the most likely explanation would be: your libraries do not appear to look like PBAT libraries! The easiest way to check this would be to look at the base composition plot of Read 1. If it is low in C, and high in G, the library is directional. Read 2 will be its reverse complement, so low in G and hight in C.

The simplest solution would then be to simply run the samples in default (directional) mode (even though I think reversing R1 and R2 and using PBAT mode will do the same thing, only the strand assignments will be to the CTOT and CTOB strands instead of the intended OT and OB strands).

If the libraries aren't PBAT, I wonder if the trimming wasn't a bit excessive (you are losing 36 bp for each read pair outright)? I'd be happy to look at some QC plots, or if a send a few reads over the (200K should be enough, and fit into an email) I can run a quick test over here.

Finally, I don't think -p 20 won't help you a great deal, I think there is a diminishing return after anything more than -p 4. So maybe you can save a few resources.

mo716 commented 3 years ago

Hi @FelixKrueger ! Thank you very much for your fast reply and help! I just sent you by e-mail the fastqc files and some small subsets of the raw and trimmed reads for a given sample.

FelixKrueger commented 3 years ago

Thanks for that, I'm on it (I already think I know what it is, will report back within the hour).

FelixKrueger commented 3 years ago

Hi again,

I am pretty certain your libraries were constructed with the Accel Swift kit. I believe this kit uses a post-bisulfite method, but it does not in fact give rise to PBAT style libraries. Instead, it creates directional libraries, with a very pronounced bias of Gs at the start of Read 2.

Therefore, the right way to go about this is to trim the start of both R1 and R2 (which you have accidentally already done, even though not completely), and potentially also the last bases on the 3' end to avoid reading into the biases from the other read in case of read-through, but I am not sure if that is so relevant. We have written this up a little more here: https://github.com/FelixKrueger/Bismark/blob/master/Docs/README.md#ix-notes-about-different-library-types-and-commercial-kits

So I have just used a very quick trimming run like this:

trim_galore --clip_r1 10 --clip_r2 15 --three_prime_clip_r1 10 --three_prime_clip_r2 10 --paired raw_E11_1W1_L1_R1.fastq.gz raw_E11_1W1_L1_R2.fastq.gz

and then ran a standard Bismark command on the rapeseed genome:

 bismark --genome genome/ -1 raw_E11_1W1_L1_R1_val_1.fq.gz -2 raw_E11_1W1_L1_R2_val_2.fq.gz

This results in a ~62% mapping efficiency (and ~23% ambiguous) alignments. So I would say: all good, swap the command line options, and all will be fine!

Cheers, Felix

mo716 commented 3 years ago

Hi @FelixKrueger ! Thank you very much, that was indeed helpful. I will trim the reads again based on your recommendations for the Accel Swift kit and re-do the alignments as you suggested Have a great day! :)