PeteHaitch / methtuple

methtuple is a methylation caller for methylation events that co-occur on the same DNA fragment from high-throughput bisulfite sequencing data, such as methylC-seq.
MIT License
12 stars 3 forks source link

`comethylation` may not work with soft-clipped reads or reads containing indels #45

Closed PeteHaitch closed 10 years ago

PeteHaitch commented 10 years ago

comethylation. uses read.alen to infer read length. This may give odd results for soft-clipped reads or reads containing indels. I need to investigate which of the length methods is most appropriate: alen, inferred_length, qlen or rlen.

To be on the safe side, the initial public release of comethylation.py will simply skip reads containing indels.

PeteHaitch commented 10 years ago

Bismark does not allow soft-clipping of reads.

Bismark with Bowtie1 Bowtie1 does not do soft-clipping.

Bismark with Bowtie2 While Bowtie2 can do soft-clipping, Bismark does not permit Bowtie2 to soft-clip reads, i.e. it runs Bowtie2 with the --end-to-end flag and not the --local flag. From the Bismark (v10.1) manual:

Bismark limits Bowtie2 to only perform end-to-end alignments, i.e. searches for alignments involving all read characters (also called untrimmed or unclipped alignments).

PeteHaitch commented 10 years ago

This is a (non-bisulfite) read aligned with bwa mem that is soft-clipped.

DJTPB5M1:306:D278CACXX:4:1311:16397:99259       147     chr1    3002039 60      90M10S  =       3001933 -196    TAATTTTGTTAAAGATATTTGCTGGTCCTTTAAGTTGAAAATCTTCATTCTCACCTACTCTTGTTGTCCATATGTTTGGGCTTTTCATTGTGTCCTAGAT    BB@ABBCCBAA@CC@@BBBACCABAACC??>?A@AAB?AA?ACBBC>ABAABACB>@A@BBAA@A@ACC?@?A@@@ABBCC@A@AC?AAAAA@CB>A>=?    BD:Z:LMMEIQKPPNGPLPMMNFNOROOOLNPNEMKMLLMMKDDLMJMLJNLLJIJGKNNMKIJMLGLMGKMNNKKMGLDMMIOQMDDLJNLNHHHMORRQKOLL       PG:Z:MarkDuplicates.8   RG:Z:277_D278CACXX_GTGAAA_L004  BI:Z:RNRJNUTSTUNTSTUTTKSRTRUPQSQQJQRRQPRSPKKNOPQPOQRPPPOOSQPRPPPQRPORPPROPSQRPPIQSNQRQIIONPQRQOQQSQPSQRPP       NM:i:0  MQ:i:60 AS:i:90 XS:i:22

The last 10 cycles of the read have been soft-clipped:

>>> read.cigarstring
'90M10S'

Here's a comparison of read.alen, read.inferred_length, read.qlen and read.rlen. read.alen

>>> read.alen
90

read.inferred_length (not available in PySam v < 0.7.6)

>>> read.inferred_length
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'pysam.csamtools.AlignedRead' object has no attribute 'inferred_length'

read.qlen

>>> read.qlen
90

read.rlen

>>> read.rlen
100
PeteHaitch commented 10 years ago

This is a (non-bisulfite) read aligned with bwa mem that contains an insertion.

DJTPB5M1:306:D278CACXX:3:2209:2706:73074        147     chr1    3089005 0       70M1I29M        =       3088915 -189    GTCCCCTATCTGATTTAGGATAGGTAAAGATCCTTTCCCAATCTGTTGNTGGTCTTTTTGTCTTATTGACAAGTGTCTTTTGCCTTACAGAAGCTTTGCA    DDDDEEAACECD@AA?BABBACB@?@@CB@ACC@@ACCC??AAABAA?#AB@AB@@@@A@AC@@@@AB@C?A@A@ABA@@ABCBA?@BBCABCC??@>??    BD:Z:LNJJTSOQMQPQMFOOPPOLNOONLENKOMMOMDLMINMLMJNGLLKKKNNKILCDDLGJILMKKLMNGMMLGGKJLCDLNOOMMMGNKLNQRPGPOPLL       PG:Z:MarkDuplicates.9   RG:Z:277_D278CACXX_GTGAAA_L003  BI:Z:QSNNUTUSSTUTTLTTSRSTSSPRSKSQRPSQQJPSNPQNPPQPPRPPPSNQOPIIIQPQPQQQQQRROPQPNOPPQIIQPQPPQQOQQPRTSQJSQRPP       NM:i:4  MQ:i:0  AS:i:80 XS:i:90

There is a one base insertion at cycle 71 in the read.

>>> read.cigarstring
'70M1I29M'

Here's a comparison of read.alen, read.inferred_length, read.qlen and read.rlen. read.alen

>>> read.alen
99

read.inferred_length (not available in PySam v < 0.7.6)

>>> read.inferred_length
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'pysam.csamtools.AlignedRead' object has no attribute 'inferred_length'

read.qlen

>>> read.qlen
100

read.rlen

>>> read.rlen
100
PeteHaitch commented 10 years ago

This is a (non-bisulfite) read aligned with bwa mem that contains a deletion and soft-clipped cycles.

DJTPB5M1:306:D278CACXX:3:2307:21152:57403       83      chr1    3006697 60      5S60M1D35M      =       3006563 -230    GAAATTGGAAGGGCCGGGAGGGAGGAAAAGAAAAAAATTATGCAATTATATTTTAATATCCAAAAAAAAAAAAAAAAAAAAAAGAGTAACGGAATGGGAA    ####################################################################?A?A@BAA???A@ABBCA?@@;BD@@BBB?@@    BD:Z:LLLLPPOOONNMMMMMLLLLLLLLLLLKKKKKKKKKKKKKKKKJKKKKKKKKJJJJJJJJJJJJKKKKDDDDDDDCCDDDDMJJLNLMMOPNOQRNOLLL       PG:Z:MarkDuplicates.9   RG:Z:277_D278CACXX_GTGAAA_L003  BI:Z:PPPPTSSSSSRRRRRRQQQQQQQPPPPPPOPPPPPOPPPPOPPOPPPPPPPPPOPOOOOOOOOOOOOOJJJJJJJJJJJJJQPOPQQQROQPNTUPQPPP       NM:i:8  MQ:i:60 AS:i:53 XS:i:32

The first 5 cycles are soft-clipped and there is a deletion at cycle 66 of the read.

>>> read.cigarstring
'5S60M1D35M'

Here's a comparison of read.alen, read.inferred_length, read.qlen and read.rlen. read.alen

>>> read.alen
96

read.inferred_length (not available in PySam v < 0.7.6)

>>> read.inferred_length
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'pysam.csamtools.AlignedRead' object has no attribute 'inferred_length'

read.qlen

>>> read.qlen
95

read.rlen

>>> read.rlen
100
PeteHaitch commented 10 years ago

This is a (non-bisulfite) read aligned with bwa mem that contains a deletion.

DJTPB5M1:306:D278CACXX:5:2216:11367:15711       83      chr1    3006733 70      24M1D76M        =       3006639 -195    GAAATTATATTGTAATACCAAAAAAAAAAAAAAAAAAAAAAAGAGTAACGGAATGGGAAAAAAAAAGAAAGAAAAAAGAAAAAGAAAAGAAAAAGAAAAA    ####################AB=;088BBBBBBBBBBBAAACCCB?BA:BB??BBBA????????BC??AB?????AB????AB???BCAAA@BB@@>@A    OC:Z:17M1D83M   BD:Z:LLLLPPOONNNNMMMMLLLLEEEEEEDDDDDDDDDDDDDCMJJLMLLKMNKLLMHMJCCCCCCDMJKDMJKDDDDLIJDDDMJKDENKMFFGPNOIEELL       PG:Z:MarkDuplicates.7   RG:Z:277_D278CACXX_GTGAAA_L005  BI:Z:PPPPTSSSSSSRRRQQQQQQLLLKKKKKKKJKKKKJKKKKRQPQQRQROPPMSSNPOJJJJJJJQPOJQPOJJJJQPOJJJQPOJJQPOKKKRRQLKKPP       NM:i:5  MQ:i:60 AS:i:81 XS:i:33

There is a deletion at cycle 25 of the read.

>>> read.cigarstring
'24M1D76M'

Here's a comparison of read.alen, read.inferred_length, read.qlen and read.rlen. read.alen

>>> read.alen
101

read.inferred_length (not available in PySam v < 0.7.6)

>>> read.inferred_length
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'pysam.csamtools.AlignedRead' object has no attribute 'inferred_length'

read.qlen

>>> read.qlen
100

read.rlen

>>> read.rlen
100
PeteHaitch commented 10 years ago

This is a (non-bisulfite) read aligned with bwa mem that contains a deletion, an insertion and soft-clipped cycles.

DJTPB5M1:306:D278CACXX:3:1101:8598:39906        163     chr1    4995124 28      40S22M4I12M1D22M        =       4995234 152     CAATGAGATCCTTGCAGCCAGGAATCCCATTGCATTCTCCTGGAAAGAATGTTTGATGAAATTTAAAAAAAAAAAAATCTTGCCTTGCTGGGGATTGCAA    ?@=?A@CA@CBBACBABBBABCA@?BBBA??CBA??BABBACCA@@BA@?BAAAB@?BA@@????@@?@?B@BAAA?ACCA@DDBADDCEEEEBABBCDC    BD:Z:LLMLRMLLOOOOMMONNPONNOLKKNNINMLMONMLKJIMMNNMLCLILKNGLDMJMNJLDKLDKMDDDDDEEEEEELOKNNQQPOOQRQQLMQQQNPOM       PG:Z:MarkDuplicates.9   RG:Z:277_D278CACXX_GTGAAA_L003  BI:Z:PPRRQOPPPRQRRPQPPRQSPPRPRQPNSRMPRPRMOPPQRQOROIPOPQOPQKQOPPOPJRNKRRJJJJJJJKJKKSSRSRSSTTSTVSSQQVSRQRQR       NM:i:6  MQ:i:28 AS:i:34 XS:i:33

The first 40 cycles are soft-clipped, there is an insertion at cycle 63 and a deletion at cycle 79 of the read.

>>> read.cigarstring
'40S22M4I12M1D22M'

Here's a comparison of read.alen, read.inferred_length, read.qlen and read.rlen. read.alen

>>> read.alen
57

read.inferred_length (not available in PySam v < 0.7.6)

>>> read.inferred_length
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'pysam.csamtools.AlignedRead' object has no attribute 'inferred_length'

read.qlen

>>> read.qlen
60

read.rlen

>>> read.rlen
100
PeteHaitch commented 10 years ago

Another point to check when using soft-clipped reads will be the definition of n_overlap, which is currently:

n_overlap = read_1.alen + read_2.alen - abs(read_1.tlen)
PeteHaitch commented 10 years ago

comethylation will only process a read if it contains M in the CIGAR string, i.e. even = and X are disallowed.

PeteHaitch commented 10 years ago

comethylation correctly handles reads with indels as of v1.3.0 and should also correctly handle soft-clipped reads. However, since Bismark does not allow soft-clipping the latter needs to be more thoroughly tested once I have BS-seq data aligned with software that allows soft-clips (such as bwa-meth).

PeteHaitch commented 10 years ago

Closing as comethylation now supports indels and, in principle, soft-clipped reads. As support of soft-clipped reads is only in principle I will open a new issue to address this.