ArtPoon / gotoh2

Lightweight and customizable Python/C extension for pairwise alignment of genetic sequences using the Gotoh algorithm
GNU Affero General Public License v3.0
5 stars 2 forks source link

Not working for sequences of about 100bp or more #6

Closed ArtPoon closed 7 years ago

ArtPoon commented 7 years ago

To reproduce:

from gotoh2.aligner import Aligner
g2 = Aligner()
g2.is_global = False
s1 = "TTTTTAGATGGGATAGATAAAGCTCAAGAAGAACATGAAAGATATCACAGCAATTGGAGAGCAATGGCTAGTGATTTTAATCTGCCACCTATAGTAGCAA"
s2 = "TTTTTGGATGGAATAGATAAGGCTCAAGAAGAACATGAGAAATATCACAACAATTGGAGAGCAATGGCTAGTGATTTTAACCTACCACCCGTGGTAGCAA"

We can align slightly shorter sequences:

>>> g2.align(s1[:90], s2[:90])
('TTTTTAG-ATGGGA-TAGATAAAG-CTCAAGAAGAACATGA-AAGATATCACA-GCAATTGGAGAGCAATGGCTAGTGATTTTAATC-T-GCCACC-T', 'TTTTT-GGATGG-AATAGATAA-GGCTCAAGAAGAACATGAGAA-ATATCACAA-CAATTGGAGAGCAATGGCTAGTGATTTTAA-CCTA-CCACCC-', 380)

but we can't do all 100:

>>> g2.align(s1, s2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/gotoh2-0.1-py2.7-linux-x86_64.egg/gotoh2/aligner.py", line 73, in align
    self.matrix
RuntimeError: Traceback failed, try local alignment
ArtPoon commented 7 years ago

Note the above examples were run with default gap open penalty of 1, which is very permissive of gaps. Interesting: if I set g2.gap_open_penalty=5 then I get results for the full 100bp sequences:

>>> g2.gap_open_penalty = 5
>>> g2.align(s1[:90], s2[:90])
('TTTTTAGATGGGATAGATAAAGCTCAAGAAGAACATGA-AAGATATCACAGCAATTGGAGAGCAATGGCTAGTGATTTTAATCTGCCACCT', 'TTTTTGGATGGAATAGATAAGGCTCAAGAAGAACATGAGAA-ATATCACAACAATTGGAGAGCAATGGCTAGTGATTTTAACCTACCACCC', 370)
>>> g2.align(s1, s2)
('TTTTTAGATGGGATAGATAAAGCTCAAGAAGAACATGA-AAGATATCACAGCAATTGGAGAGCAATGGCTAGTGATTTTAATCTGCCACCTATAGTAGCAA', 'TTTTTGGATGGAATAGATAAGGCTCAAGAAGAACATGAGAA-ATATCACAACAATTGGAGAGCAATGGCTAGTGATTTTAACCTACCACCCGTGGTAGCAA', 402)
ArtPoon commented 7 years ago

Wrote a unit test that reproduces this problem. Also printing cost matrix to stdout. When gap_open_penalty is set to 3, we get no error and this matrix (last 10 rows and columns only):

84 48 48 48 48 48 48 48 48 48 18 
72 84 40 40 80 48 48 112 48 48 18 
72 40 84 48 40 80 48 48 48 50 18 
72 80 40 84 48 40 80 48 48 48 22 
72 72 72 72 84 48 40 80 48 48 18 
72 72 80 72 72 84 48 40 48 50 18 
72 88 72 80 72 72 84 48 48 48 22 
72 72 72 72 80 72 72 84 48 48 18 
72 72 72 72 72 72 72 72 4 18 18 
73 88 73 72 72 73 72 72 65 4 22 
65 68 65 69 65 65 69 65 65 69 4 

but if it is set to 2 then there is an error and this matrix:

84 48 48 48 48 48 48 48 48 48 18 
72 124 40 40 80 48 48 112 48 48 18 
72 40 84 48 40 80 48 48 48 50 18 
72 80 40 84 48 40 80 48 48 48 22 
72 72 104 72 84 48 40 80 48 48 18 
72 72 80 104 72 84 48 40 56 58 26 
72 88 72 80 104 72 84 48 48 48 22 
72 72 72 72 80 104 72 84 48 48 18 
72 72 72 72 72 72 72 72 4 18 18 
73 88 73 72 72 73 72 72 65 4 22 
65 68 65 69 65 65 69 65 65 69 4 

Note is_global is set to True. Edit: Whoops, these are bit matrices, not cost matrices

ArtPoon commented 7 years ago

Got some more debugging output, traceback fails at position i=83, j=82. Bit matrix from 79 to 89:

79 80 81 82 83 84
79 84 48 48 48 48 48
80 72 84 48 48 48 48
81 72 72 124 40 40 40
82 72 72 40 47 40 40
83 72 72 80 48 54 48
84 72 72 72 120 40 84

Cost matrix, same slice:

79 80 81 82 83 84
79 -339 -336 -335 -334 -333 -332
80 -336 -344 -341 -340 -339 -338
81 -335 -341 -340 -337 -336 -335
82 -334 -340 -337 -336 -335 -334
83 -333 -339 -345 -342 -341 -340
84 -332 -338 -342 -341 -338 -346
ArtPoon commented 7 years ago

48 is bit-string 110000, which corresponds to setting flags for e and f.