Strangely not getting any mapping for the bottom strand

PoGibas commented 4 years ago

I have a strange problem - mapping for the bottom strand is not reported even though I can "see" bottom strand reads in the fastq file.

Background Got reads generated by Ion Torrent PGM, they came from a small region (added as seq.fasta). Tried to map them using default parameters or with --non_directional, however I'm not getting any mapping for the bottom strand (added Bismark reports):

# Report 1
Number of sequences with unique best (first) alignment came from the bowtie output:
CT/CT:  2433    ((converted) top strand)
CT/GA:  0       ((converted) bottom strand)
GA/CT:  11901   (complementary to (converted) top strand)
GA/GA:  0       (complementary to (converted) bottom strand)

# Report 2
CpG_OT_reads_bismark_bt2.txt contains data ->   kept
CpG_CTOT_reads_bismark_bt2.txt contains data -> kept
CpG_CTOB_reads_bismark_bt2.txt was empty ->     deleted
CpG_OB_reads_bismark_bt2.txt was empty ->       deleted
CHG_OT_reads_bismark_bt2.txt contains data ->   kept
CHG_CTOT_reads_bismark_bt2.txt contains data -> kept
CHG_CTOB_reads_bismark_bt2.txt was empty ->     deleted
CHG_OB_reads_bismark_bt2.txt was empty ->       deleted
CHH_OT_reads_bismark_bt2.txt contains data ->   kept
CHH_CTOT_reads_bismark_bt2.txt contains data -> kept
CHH_CTOB_reads_bismark_bt2.txt was empty ->     deleted
CHH_OB_reads_bismark_bt2.txt was empty ->       deleted

As we can see there were no reads for the bottom strand. I manually tried to track down reads that should come from the bottom strand (looked into --unmapped result). Bellow I'm attaching fasta and one unmapped read:

# Fasta (seq.fasta)
>my_seq
AAAAGTCACTGCTGGTGGGAGAGGCTGAGTGCACAGAGCCAAGCACACACGTTAACAAGAGCCGACAGCGTCAAAATCGCAAGAACAGAGGAAAGCCACTGACgactgaggagatggctcagttcatagagtgcttgccatacagagaggaggacccgagctcaaatgccTGAGGTGTACTGGCCAGCCAGTGACAATCCTGTCTCAAACAGTAAAGT

# Fastq (one read from reads.fastq)
@CDMLM:01424:12439
CCAATACACCTCAAACATTTAAACTCGAATCCTCCTCTCTATATAACAAACACTCTAATAAACTAAAACCATCTCCTCAATCGTCAATAAACTTTCCCTCTATTCTTACGATTTTAACGCTATCGACTCTTATTAACGTATATACTTAACTCTATACACTCAACCTCTCCCACCAACAATAACTTTTC
+
:7<7<<<<;7?776-485<.77/69;;;6;<6;;166999;:99>498>277556.-*-44.34445-72766673766/597744/422/111,00,0422436616242555)-*-553698778:::4:;5:5=:::<999;;7=7<<<;:9;;;;;;>1717;;;<3;;5;7::5:<5=;==0/

As we can see genomic sequence starts with AAAAGTCAC and read ends with ATAACTTTT and it should come from the bottom strand.

What could be the issue behind this?

Files: Added seq.fasta and reads.fastq. Commands:

bismark_genome_preparation ./
bismark ./ reads.fastq --unmapped --non_directional
bismark_methylation_extractor reads_bismark_bt2.bam --bedGraph
gunzip *gz
coverage2cytosine reads_bismark_bt2.bismark.cov -o FINAL --genome_folder ./

Bismark Version: v0.22.3

Archive.zip

FelixKrueger commented 4 years ago

Hi @PoGibas

I think the most likely explanation here is that you are trying to look at specific short regions of the genome that were sequenced using PCR amplification. In such amplicon experiments, a locus is typically targetted by looking at the top strand reference sequence, and designing primers to that locus. Alternatively, you might have used target enrichment before doing the bisulfite experiment, but the results would be conceptually the same.

Since genomic DNA loses its complementarity after once the DNA is bisulfite converted, you only ever PCR amplify the top strand and hence get only methylation information from the top strand. If you wanted to get information from the bottom strand as well you would need to use a second set of primers that will bind (exclusively) to the bottom strand once it is bisulfite converted.

The 'bottom' strand sequences you appear to be tracking in your FastQ file do not really originate from the bottom strand. Instead, these sequences are complementary to the original top strand (OT) (what we call CTOT, complementary to OT). While such CTOT sequences technically align in reverse orientation to a C>T converted version of the top strand, they carry methylation information derived from the top strand as well. Please see slide 14 in this presentation, which I hope will clear things up a little: https://www.bioinformatics.babraham.ac.uk/training/Methylation_Course/BS-Seq%20theory%20and%20QC%20lecture.pdf

Regarding your test case: The sequence you linked does not align at under normal conditions since the query sequence is 1 bp longer than the reference sequence (and such scenarios are not compatible with end-to-end alignments).

reference
>my_seq
AAAAGTCACTGCTGGTGGGAGAGGCTGAGTGCACAGAGCCAAGCACACACGTTAACAAGAGCCGACAGCGTCAAAATCGCAAGAACAGAGGAAAGCCACTGACgactgaggagatggctcagttcatagagtgcttgccatacagagaggaggacccgagctcaaatgccTGAGGTGTACTGGCCAGCCAGTGACAATCCTGTCTCAAACAGTAAAGT

@CDMLM:01424:12439
CCAATACACCTCAAACATTTAAACTCGAATCCTCCTCTCTATATAACAAACACTCTAATAAACTAAAACCATCTCCTCAATCGTCAATAAACTTTCCCTCTATTCTTACGATTTTAACGCTATCGACTCTTATTAACGTATATACTTAACTCTATACACTCAACCTCTCCCACCAACAATAACTTTTC
+
:7<7<<<<;7?776-485<.77/69;;;6;<6;;166999;:99>498>277556.-*-44.34445-72766673766/597744/422/111,00,0422436616242555)-*-553698778:::4:;5:5=:::<999;;7=7<<<;:9;;;;;;>1717;;;<3;;5;7::5:<5=;==0/

Using the additional option --local in this case would simply soft-clip the additional C of the read, and produce this alignment:

@HD     VN:1.0  SO:unsorted
@SQ     SN:my_seq       LN:218
@PG     ID:Bismark      VN:v0.22.1      CL:"bismark --genome genome/ test_read.fq --non_d --local"
CDMLM:01424:12439       0       my_seq  1       0       1S89M1I6M1I22M1I9M1I57M *       0       0       GAAAAGTTATTGTTGGTGGGAGAGGTTGAGTGTATAGAGTTAAGTATATACGTTAATAAGAGTCGATAGCGTTAAAATCGTAAGAATAGAGGGAAAGTTTATTGACGATTGAGGAGATGGTTTTAGTTTATTAGAGTGTTTGTTATATAGAGAGGAGGATTCGAGTTTAAATGTTTGAGGTGTATTGG /0==;=5<:5::7;5;;3<;;;7171>;;;;;;9:;<<<7=7;;999<:::=5:5;:4:::877896355-*-)5552426166342240,00,111/224/447795/66737666727-54443.44-*-.655772>894>99:;999661;;6<;6;;;96/77.<584-677?7;<<<<7<7: NM:i:40 MD:Z:6C1C2C12C6C1C4C0C3C1C1C7C5C3C5C7C5C9C0C1C6C11C1C4C8C3C0C3C11C0C4C1C5C0C9C3 XM:Z:.......h.x..x............x......h.x....hh...h.h.h.Z.....h.....xZ..x..Z..h.....Z.h.....x...........hh.x...Z..x............h.x....u.........h...hh...x...........hxZ...h.h.....hx.........x...    XR:Z:GA XG:Z:CT

And indeed, this sequence aligns in reverse orientation to the converted top strand, so the sequence is of strand CTOT, and thus informative for the top strand.

CT/CT:  0       ((converted) top strand)
CT/GA:  0       ((converted) bottom strand)
GA/CT:  1       (complementary to (converted) top strand)
GA/GA:  0       (complementary to (converted) bottom strand)

Does this make sense?

PoGibas commented 4 years ago

Thanks Felix!

That absolutely make sense. Issue is that this amplification should have had reads only from the bottom strand (according to the wet-lab people). With --local I'm getting 91% mapping efficiency and again only to the top strand. I guess that problem should be at the wet-lab side then? I understand that this is quite a statement, but programmatically I can't find what I should be getting.

FelixKrueger commented 4 years ago

I am sure it is some kind of confusion in the whole concept, caused by the 4 bisulfite strands rather than the standard 2 genomic strands. Maybe Primer 1 was the one that binds in reverse orientation (to the top strand), and the people who designed the experiment therefore assumed it was the bottom strand? (which would be true, if this was a standard genomic experiment.

I would suggest to take a pen and piece of paper and draw it out in front of you (top+bottom genomic strands, what the four strands look like after bisulfite conversion, and then see what the oligoes actually amplify. This should then get (fairly) obvious (fairly) soon. I find it confusing myself....

FelixKrueger commented 4 years ago

I assume that this will be all for the moment, it looks more like an experimental design decision issue than anything on the alignment side.

PoGibas commented 4 years ago

I still can't confirm this, however it's very likely to be true.

FelixKrueger commented 4 years ago

I'll be happy to help you look into this again if you find that something appears to look fishy. All the best, Felix

FelixKrueger / Bismark

Strangely not getting any mapping for the bottom strand #292