Closed PeteHaitch closed 10 years ago
Bismark does not allow soft-clipping of reads.
Bismark with Bowtie1 Bowtie1 does not do soft-clipping.
Bismark with Bowtie2
While Bowtie2 can do soft-clipping, Bismark does not permit Bowtie2 to soft-clip reads, i.e. it runs Bowtie2 with the --end-to-end
flag and not the --local
flag.
From the Bismark (v10.1) manual:
Bismark limits Bowtie2 to only perform end-to-end alignments, i.e. searches for alignments involving all read characters (also called untrimmed or unclipped alignments).
This is a (non-bisulfite) read aligned with bwa mem
that is soft-clipped.
DJTPB5M1:306:D278CACXX:4:1311:16397:99259 147 chr1 3002039 60 90M10S = 3001933 -196 TAATTTTGTTAAAGATATTTGCTGGTCCTTTAAGTTGAAAATCTTCATTCTCACCTACTCTTGTTGTCCATATGTTTGGGCTTTTCATTGTGTCCTAGAT BB@ABBCCBAA@CC@@BBBACCABAACC??>?A@AAB?AA?ACBBC>ABAABACB>@A@BBAA@A@ACC?@?A@@@ABBCC@A@AC?AAAAA@CB>A>=? BD:Z:LMMEIQKPPNGPLPMMNFNOROOOLNPNEMKMLLMMKDDLMJMLJNLLJIJGKNNMKIJMLGLMGKMNNKKMGLDMMIOQMDDLJNLNHHHMORRQKOLL PG:Z:MarkDuplicates.8 RG:Z:277_D278CACXX_GTGAAA_L004 BI:Z:RNRJNUTSTUNTSTUTTKSRTRUPQSQQJQRRQPRSPKKNOPQPOQRPPPOOSQPRPPPQRPORPPROPSQRPPIQSNQRQIIONPQRQOQQSQPSQRPP NM:i:0 MQ:i:60 AS:i:90 XS:i:22
The last 10 cycles of the read have been soft-clipped:
>>> read.cigarstring
'90M10S'
Here's a comparison of read.alen
, read.inferred_length
, read.qlen
and read.rlen
.
read.alen
>>> read.alen
90
read.inferred_length (not available in PySam v < 0.7.6)
>>> read.inferred_length
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'pysam.csamtools.AlignedRead' object has no attribute 'inferred_length'
read.qlen
>>> read.qlen
90
read.rlen
>>> read.rlen
100
This is a (non-bisulfite) read aligned with bwa mem
that contains an insertion.
DJTPB5M1:306:D278CACXX:3:2209:2706:73074 147 chr1 3089005 0 70M1I29M = 3088915 -189 GTCCCCTATCTGATTTAGGATAGGTAAAGATCCTTTCCCAATCTGTTGNTGGTCTTTTTGTCTTATTGACAAGTGTCTTTTGCCTTACAGAAGCTTTGCA DDDDEEAACECD@AA?BABBACB@?@@CB@ACC@@ACCC??AAABAA?#AB@AB@@@@A@AC@@@@AB@C?A@A@ABA@@ABCBA?@BBCABCC??@>?? BD:Z:LNJJTSOQMQPQMFOOPPOLNOONLENKOMMOMDLMINMLMJNGLLKKKNNKILCDDLGJILMKKLMNGMMLGGKJLCDLNOOMMMGNKLNQRPGPOPLL PG:Z:MarkDuplicates.9 RG:Z:277_D278CACXX_GTGAAA_L003 BI:Z:QSNNUTUSSTUTTLTTSRSTSSPRSKSQRPSQQJPSNPQNPPQPPRPPPSNQOPIIIQPQPQQQQQRROPQPNOPPQIIQPQPPQQOQQPRTSQJSQRPP NM:i:4 MQ:i:0 AS:i:80 XS:i:90
There is a one base insertion at cycle 71 in the read.
>>> read.cigarstring
'70M1I29M'
Here's a comparison of read.alen
, read.inferred_length
, read.qlen
and read.rlen
.
read.alen
>>> read.alen
99
read.inferred_length (not available in PySam v < 0.7.6)
>>> read.inferred_length
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'pysam.csamtools.AlignedRead' object has no attribute 'inferred_length'
read.qlen
>>> read.qlen
100
read.rlen
>>> read.rlen
100
This is a (non-bisulfite) read aligned with bwa mem
that contains a deletion and soft-clipped cycles.
DJTPB5M1:306:D278CACXX:3:2307:21152:57403 83 chr1 3006697 60 5S60M1D35M = 3006563 -230 GAAATTGGAAGGGCCGGGAGGGAGGAAAAGAAAAAAATTATGCAATTATATTTTAATATCCAAAAAAAAAAAAAAAAAAAAAAGAGTAACGGAATGGGAA ####################################################################?A?A@BAA???A@ABBCA?@@;BD@@BBB?@@ BD:Z:LLLLPPOOONNMMMMMLLLLLLLLLLLKKKKKKKKKKKKKKKKJKKKKKKKKJJJJJJJJJJJJKKKKDDDDDDDCCDDDDMJJLNLMMOPNOQRNOLLL PG:Z:MarkDuplicates.9 RG:Z:277_D278CACXX_GTGAAA_L003 BI:Z:PPPPTSSSSSRRRRRRQQQQQQQPPPPPPOPPPPPOPPPPOPPOPPPPPPPPPOPOOOOOOOOOOOOOJJJJJJJJJJJJJQPOPQQQROQPNTUPQPPP NM:i:8 MQ:i:60 AS:i:53 XS:i:32
The first 5 cycles are soft-clipped and there is a deletion at cycle 66 of the read.
>>> read.cigarstring
'5S60M1D35M'
Here's a comparison of read.alen
, read.inferred_length
, read.qlen
and read.rlen
.
read.alen
>>> read.alen
96
read.inferred_length (not available in PySam v < 0.7.6)
>>> read.inferred_length
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'pysam.csamtools.AlignedRead' object has no attribute 'inferred_length'
read.qlen
>>> read.qlen
95
read.rlen
>>> read.rlen
100
This is a (non-bisulfite) read aligned with bwa mem
that contains a deletion.
DJTPB5M1:306:D278CACXX:5:2216:11367:15711 83 chr1 3006733 70 24M1D76M = 3006639 -195 GAAATTATATTGTAATACCAAAAAAAAAAAAAAAAAAAAAAAGAGTAACGGAATGGGAAAAAAAAAGAAAGAAAAAAGAAAAAGAAAAGAAAAAGAAAAA ####################AB=;088BBBBBBBBBBBAAACCCB?BA:BB??BBBA????????BC??AB?????AB????AB???BCAAA@BB@@>@A OC:Z:17M1D83M BD:Z:LLLLPPOONNNNMMMMLLLLEEEEEEDDDDDDDDDDDDDCMJJLMLLKMNKLLMHMJCCCCCCDMJKDMJKDDDDLIJDDDMJKDENKMFFGPNOIEELL PG:Z:MarkDuplicates.7 RG:Z:277_D278CACXX_GTGAAA_L005 BI:Z:PPPPTSSSSSSRRRQQQQQQLLLKKKKKKKJKKKKJKKKKRQPQQRQROPPMSSNPOJJJJJJJQPOJQPOJJJJQPOJJJQPOJJQPOKKKRRQLKKPP NM:i:5 MQ:i:60 AS:i:81 XS:i:33
There is a deletion at cycle 25 of the read.
>>> read.cigarstring
'24M1D76M'
Here's a comparison of read.alen
, read.inferred_length
, read.qlen
and read.rlen
.
read.alen
>>> read.alen
101
read.inferred_length (not available in PySam v < 0.7.6)
>>> read.inferred_length
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'pysam.csamtools.AlignedRead' object has no attribute 'inferred_length'
read.qlen
>>> read.qlen
100
read.rlen
>>> read.rlen
100
This is a (non-bisulfite) read aligned with bwa mem
that contains a deletion, an insertion and soft-clipped cycles.
DJTPB5M1:306:D278CACXX:3:1101:8598:39906 163 chr1 4995124 28 40S22M4I12M1D22M = 4995234 152 CAATGAGATCCTTGCAGCCAGGAATCCCATTGCATTCTCCTGGAAAGAATGTTTGATGAAATTTAAAAAAAAAAAAATCTTGCCTTGCTGGGGATTGCAA ?@=?A@CA@CBBACBABBBABCA@?BBBA??CBA??BABBACCA@@BA@?BAAAB@?BA@@????@@?@?B@BAAA?ACCA@DDBADDCEEEEBABBCDC BD:Z:LLMLRMLLOOOOMMONNPONNOLKKNNINMLMONMLKJIMMNNMLCLILKNGLDMJMNJLDKLDKMDDDDDEEEEEELOKNNQQPOOQRQQLMQQQNPOM PG:Z:MarkDuplicates.9 RG:Z:277_D278CACXX_GTGAAA_L003 BI:Z:PPRRQOPPPRQRRPQPPRQSPPRPRQPNSRMPRPRMOPPQRQOROIPOPQOPQKQOPPOPJRNKRRJJJJJJJKJKKSSRSRSSTTSTVSSQQVSRQRQR NM:i:6 MQ:i:28 AS:i:34 XS:i:33
The first 40 cycles are soft-clipped, there is an insertion at cycle 63 and a deletion at cycle 79 of the read.
>>> read.cigarstring
'40S22M4I12M1D22M'
Here's a comparison of read.alen
, read.inferred_length
, read.qlen
and read.rlen
.
read.alen
>>> read.alen
57
read.inferred_length (not available in PySam v < 0.7.6)
>>> read.inferred_length
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'pysam.csamtools.AlignedRead' object has no attribute 'inferred_length'
read.qlen
>>> read.qlen
60
read.rlen
>>> read.rlen
100
Another point to check when using soft-clipped reads will be the definition of n_overlap
, which is currently:
n_overlap = read_1.alen + read_2.alen - abs(read_1.tlen)
comethylation
will only process a read if it contains M
in the CIGAR
string, i.e. even =
and X
are disallowed.
comethylation
correctly handles reads with indels as of v1.3.0
and should also correctly handle soft-clipped reads. However, since Bismark does not allow soft-clipping the latter needs to be more thoroughly tested once I have BS-seq data aligned with software that allows soft-clips (such as bwa-meth
).
Closing as comethylation
now supports indels and, in principle, soft-clipped reads. As support of soft-clipped reads is only in principle I will open a new issue to address this.
comethylation.
usesread.alen
to infer read length. This may give odd results for soft-clipped reads or reads containing indels. I need to investigate which of the length methods is most appropriate:alen
,inferred_length
,qlen
orrlen
.To be on the safe side, the initial public release of
comethylation.py
will simply skip reads containing indels.