SaraValsoni commented 1 year ago

Hello! I’m using Bismark to analyze whole genome EM-seq and targeted BS-seq (after the bisulfite conversion the region of interest is amplified before the sequencing). For both the experiment I have paired-end sequencing. I noticed that in the whole genome EM-seq I have the methylation status for each of the 2 positions of a CG dinucleotide when looking at the cov.gz output produced for the CpG context and also if I check the alignments on IGV. Is it normal since I have the C in the second position on the opposite strand? Whereas for the targeted BS-seq I have the methylation status only for the position where there is a C on the reverse strand. I have 3 different experiments and the situation is always the same. I also noticed that the Bismark report shows the majority of the read mapping to the converted bottom strand for 2 of them, the third has more balanced percentages for the OT and the OB. Do you think this situation could be due to the different protocol for the conversion or to the whole genome/targeted approach? Thank you! Sara

FelixKrueger commented 1 year ago

Hi Sara,

This is indeed to be expected. Bisulfite conversion takes place on single-stranded DNA, so sequencing the top or bottom strand (converted) sequence essentially happens 50% of the time (if we assume that both converted strands are not unqequally stable). Therefore, in genome-wide essays (such as EM-seq) you tend to see a roughly 1:1 split of OT and OB strands.

For amplicon based sequencing assays, you tend to select one strand to design the primers against, typically the top strand sequence. Since bisulfite conversion renders the top and bottom strands no longer complementary, you end up amplifying - and sequencing - only one of two strands. Which strands end up in the library depend on the library prep method. If you run Bismark with --non_directional you can quickly find out which strands you have, and which information you want to keep.

SaraValsoni commented 1 year ago

Hi Felix,

Thank you for the reply and the explanation. Below you can find the report of Bismark for one of the target bisulfite samples when running the alignment step in directional mode as I did for the whole genome EM-seq and with the --non_directional parameter.

1) directional:

Final Alignment report

Sequence pairs analysed in total: 1432634 Number of paired-end alignments with a unique best hit: 723725 Mapping efficiency: 50.5% Sequence pairs with no alignments under any condition: 702903 Sequence pairs did not map uniquely: 6006 Sequence pairs which were discarded because genomic sequence could not be extracted: 0

Number of sequence pairs with unique best (first) alignment came from the bowtie output: CT/GA/CT: 30525 ((converted) top strand) GA/CT/CT: 0 (complementary to (converted) top strand) GA/CT/GA: 0 (complementary to (converted) bottom strand) CT/GA/GA: 693200 ((converted) bottom strand)

2) non directional:

Final Alignment report

Sequence pairs analysed in total: 1432634 Number of paired-end alignments with a unique best hit: 1402587 Mapping efficiency: 97.9% Sequence pairs with no alignments under any condition: 29150 Sequence pairs did not map uniquely: 897 Sequence pairs which were discarded because genomic sequence could not be extracted: 0

Number of sequence pairs with unique best (first) alignment came from the bowtie output: CT/GA/CT: 301 ((converted) top strand) GA/CT/CT: 498 (complementary to (converted) top strand) GA/CT/GA: 762287 (complementary to (converted) bottom strand) CT/GA/GA: 639501 ((converted) bottom strand)

As you can see the majority of the read is mapping to the OB strand and with the --non_directionl parameter the mapping efficiency is higher. In your opinion, would it be better to use the --non_directional parameter for this type of experiments?

Moreover, I saw that all target bisulfite samples have a similar situation but for some of them the percentage of read mapping to the OT is higher such as in the following example.

1) directional:

Final Alignment report

Sequence pairs analysed in total: 1187525 Number of paired-end alignments with a unique best hit: 613625 Mapping efficiency: 51.7% Sequence pairs with no alignments under any condition: 568402 Sequence pairs did not map uniquely: 5498 Sequence pairs which were discarded because genomic sequence could not be extracted: 0

Number of sequence pairs with unique best (first) alignment came from the bowtie output: CT/GA/CT: 226830 ((converted) top strand) GA/CT/CT: 0 (complementary to (converted) top strand) GA/CT/GA: 0 (complementary to (converted) bottom strand) CT/GA/GA: 386795 ((converted) bottom strand)

2) non directional

Final Alignment report

Sequence pairs analysed in total: 1187525 Number of paired-end alignments with a unique best hit: 1115340 Mapping efficiency: 93.9% Sequence pairs with no alignments under any condition: 69932 Sequence pairs did not map uniquely: 2253 Sequence pairs which were discarded because genomic sequence could not be extracted: 0

Number of sequence pairs with unique best (first) alignment came from the bowtie output: CT/GA/CT: 175256 ((converted) top strand) GA/CT/CT: 176882 (complementary to (converted) top strand) GA/CT/GA: 413922 (complementary to (converted) bottom strand) CT/GA/GA: 349280 ((converted) bottom strand)

Do you think it could be related to different efficiency of bisulfite conversion?

Thank you, Sara

FelixKrueger commented 1 year ago

Hi again, apologies for the late reply.

From the targeted reports you linked, it it very obvious from the non-directional alignments that the methylation information is targetting the bottom strand only. You should indeed use the OB and CTOB methylation calls together. The few alignments to the OT and CTOT strands (301 and 498) are probably some sort of artefact. If I were you I would probably simply delete the C*_OT_* and C*_CTOT_* files, and then run bismark2bedGraph with the remaining files. If this seems too much effort then you can also leave these files in, I suppose 1000 spurious alignments in 1.4 million correct ones won't matter too much....

Regarding the second example, this is a little more weird. Assuming the libraries have all been constructed in the same way, I would also assume that the majority of the reads should align to the OB and CTOB strands. If you do not carry out alignments to CTOB (such as in the default directional situation), you might find a higher number of reads aligning to OT, simply because the correct genome to align to (the of for the GA/CT/GA alignments) isn't present.

It would be interesting to see where these increased numbers of OT/CTOT alignments align to. You could simply use e.g. SeqMonk, import only the CTOT and OT methylation calls (using the generic text import), and check. Do they align to the regions you targetted? Do they align in a targetted way, but elsewhere/off-target? Or do these reads/call align randomly scattered over the genome, a bit more like higher background of reads?

FelixKrueger / Bismark

Methylation status in different positions for EM-seq and BS-seq #560

Final Alignment report

Final Alignment report

Final Alignment report

Final Alignment report