Closed PoGibas closed 4 years ago
Hi @PoGibas
I think the most likely explanation here is that you are trying to look at specific short regions of the genome that were sequenced using PCR amplification. In such amplicon experiments, a locus is typically targetted by looking at the top strand reference sequence, and designing primers to that locus. Alternatively, you might have used target enrichment before doing the bisulfite experiment, but the results would be conceptually the same.
Since genomic DNA loses its complementarity after once the DNA is bisulfite converted, you only ever PCR amplify the top strand and hence get only methylation information from the top strand. If you wanted to get information from the bottom strand as well you would need to use a second set of primers that will bind (exclusively) to the bottom strand once it is bisulfite converted.
The 'bottom' strand sequences you appear to be tracking in your FastQ file do not really originate from the bottom strand. Instead, these sequences are complementary to the original top strand (OT
) (what we call CTOT
, complementary to OT
). While such CTOT
sequences technically align in reverse orientation to a C>T converted version of the top strand, they carry methylation information derived from the top strand as well. Please see slide 14 in this presentation, which I hope will clear things up a little:
https://www.bioinformatics.babraham.ac.uk/training/Methylation_Course/BS-Seq%20theory%20and%20QC%20lecture.pdf
Regarding your test case: The sequence you linked does not align at under normal conditions since the query sequence is 1 bp longer than the reference sequence (and such scenarios are not compatible with end-to-end alignments).
reference
>my_seq
AAAAGTCACTGCTGGTGGGAGAGGCTGAGTGCACAGAGCCAAGCACACACGTTAACAAGAGCCGACAGCGTCAAAATCGCAAGAACAGAGGAAAGCCACTGACgactgaggagatggctcagttcatagagtgcttgccatacagagaggaggacccgagctcaaatgccTGAGGTGTACTGGCCAGCCAGTGACAATCCTGTCTCAAACAGTAAAGT
@CDMLM:01424:12439
CCAATACACCTCAAACATTTAAACTCGAATCCTCCTCTCTATATAACAAACACTCTAATAAACTAAAACCATCTCCTCAATCGTCAATAAACTTTCCCTCTATTCTTACGATTTTAACGCTATCGACTCTTATTAACGTATATACTTAACTCTATACACTCAACCTCTCCCACCAACAATAACTTTTC
+
:7<7<<<<;7?776-485<.77/69;;;6;<6;;166999;:99>498>277556.-*-44.34445-72766673766/597744/422/111,00,0422436616242555)-*-553698778:::4:;5:5=:::<999;;7=7<<<;:9;;;;;;>1717;;;<3;;5;7::5:<5=;==0/
Using the additional option --local
in this case would simply soft-clip the additional C
of the read, and produce this alignment:
@HD VN:1.0 SO:unsorted
@SQ SN:my_seq LN:218
@PG ID:Bismark VN:v0.22.1 CL:"bismark --genome genome/ test_read.fq --non_d --local"
CDMLM:01424:12439 0 my_seq 1 0 1S89M1I6M1I22M1I9M1I57M * 0 0 GAAAAGTTATTGTTGGTGGGAGAGGTTGAGTGTATAGAGTTAAGTATATACGTTAATAAGAGTCGATAGCGTTAAAATCGTAAGAATAGAGGGAAAGTTTATTGACGATTGAGGAGATGGTTTTAGTTTATTAGAGTGTTTGTTATATAGAGAGGAGGATTCGAGTTTAAATGTTTGAGGTGTATTGG /0==;=5<:5::7;5;;3<;;;7171>;;;;;;9:;<<<7=7;;999<:::=5:5;:4:::877896355-*-)5552426166342240,00,111/224/447795/66737666727-54443.44-*-.655772>894>99:;999661;;6<;6;;;96/77.<584-677?7;<<<<7<7: NM:i:40 MD:Z:6C1C2C12C6C1C4C0C3C1C1C7C5C3C5C7C5C9C0C1C6C11C1C4C8C3C0C3C11C0C4C1C5C0C9C3 XM:Z:.......h.x..x............x......h.x....hh...h.h.h.Z.....h.....xZ..x..Z..h.....Z.h.....x...........hh.x...Z..x............h.x....u.........h...hh...x...........hxZ...h.h.....hx.........x... XR:Z:GA XG:Z:CT
And indeed, this sequence aligns in reverse orientation to the converted top strand, so the sequence is of strand CTOT
, and thus informative for the top strand.
CT/CT: 0 ((converted) top strand)
CT/GA: 0 ((converted) bottom strand)
GA/CT: 1 (complementary to (converted) top strand)
GA/GA: 0 (complementary to (converted) bottom strand)
Does this make sense?
Thanks Felix!
That absolutely make sense. Issue is that this amplification should have had reads only from the bottom strand (according to the wet-lab people). With --local
I'm getting 91% mapping efficiency and again only to the top strand.
I guess that problem should be at the wet-lab side then? I understand that this is quite a statement, but programmatically I can't find what I should be getting.
I am sure it is some kind of confusion in the whole concept, caused by the 4 bisulfite strands rather than the standard 2 genomic strands. Maybe Primer 1 was the one that binds in reverse orientation (to the top strand), and the people who designed the experiment therefore assumed it was the bottom strand? (which would be true, if this was a standard genomic experiment.
I would suggest to take a pen and piece of paper and draw it out in front of you (top+bottom genomic strands, what the four strands look like after bisulfite conversion, and then see what the oligoes actually amplify. This should then get (fairly) obvious (fairly) soon. I find it confusing myself....
I assume that this will be all for the moment, it looks more like an experimental design decision issue than anything on the alignment side.
I still can't confirm this, however it's very likely to be true.
I'll be happy to help you look into this again if you find that something appears to look fishy. All the best, Felix
I have a strange problem - mapping for the bottom strand is not reported even though I can "see" bottom strand reads in the
fastq
file.Background Got reads generated by Ion Torrent PGM, they came from a small region (added as
seq.fasta
). Tried to map them using default parameters or with--non_directional
, however I'm not getting any mapping for the bottom strand (added Bismark reports):As we can see there were no reads for the bottom strand. I manually tried to track down reads that should come from the bottom strand (looked into
--unmapped
result). Bellow I'm attachingfasta
and one unmapped read:As we can see genomic sequence starts with AAAAGTCAC and read ends with ATAACTTTT and it should come from the bottom strand.
What could be the issue behind this?
Files: Added
seq.fasta
andreads.fastq
. Commands:Bismark Version: v0.22.3
Archive.zip