ArtPoon / gotoh2

Lightweight and customizable Python/C extension for pairwise alignment of genetic sequences using the Gotoh algorithm
GNU Affero General Public License v3.0
5 stars 2 forks source link

Failing simple alignment with gop=10 (default) but not with gop=8 #22

Open davidchampredon opened 4 years ago

davidchampredon commented 4 years ago
from gotoh2 import Aligner, map_coordinates

g1 = Aligner(gop=10)  # default gop=10
g2 = Aligner(gop=8)   

ref   = 'TACGTA'
query = 'TACTA'  # G removed
a1 = g1.align(ref, query)
a2 = g2.align(ref, query)
print(a1)
print(a2)

('TACGTA', 'TACT-A', 14) ('TACGTA', 'TAC-TA', 16)

ArtPoon commented 4 years ago

Thanks for reporting this @davidchampredon I've reproduced the issue.

ArtPoon commented 4 years ago

Score is consistent with the correct alignment

 T  A  C   G  T  A
 T  A  C   -  T  A 
+5 +5 +5 -11 +5 +5 = 14
ArtPoon commented 4 years ago

It's issue #16 - I knew I'd pay for that cheap fix some day.. Commenting out these lines: https://github.com/ArtPoon/gotoh2/blob/348f7f790a62a3aee68c1188a1d83a07706f5c2b/src/_gotoh2.c#L121-L126

yields the correct alignment, but breaks two other unit tests.

ArtPoon commented 4 years ago

It wasn't those lines - the problem was that local alignment requires one to zero out positive costs in the R matrix.

ArtPoon commented 4 years ago

Crap. Traceback failed while I was validating these changes on SARS-COV-2 sequences:

Process 0 of 1 running hCoV19/pangolin/Guangxi/P4L/2017|EPI_ISL_410538|2017
traceback failed: i=1 j=1 bit=0
Traceback (most recent call last):
  File "scripts/first-align.py", line 135, in <module>
    main()
  File "scripts/first-align.py", line 119, in main
    trim_seq, inserts = pair_align(refseq, s, threshold=args.threshold)
  File "scripts/first-align.py", line 66, in pair_align
    aref, aquery, ascore = aligner.align(ref, query)
  File "/usr/local/lib/python3.6/dist-packages/gotoh2-0.1-py3.6-linux-x86_64.egg/gotoh2.py", line 99, in align
    self.matrix
RuntimeError: Traceback failed, try local alignment
ArtPoon commented 4 years ago

Temporary patch:

ArtPoon commented 4 years ago

Cost matrix:

   *  A   C   T   A
*  0  0   0   0   0 
A  0 -5   0   0  -5 
C  0  0 -10   0   0 
G  0  0   0  -6   0 
T  0  0   0  -4  -2 
A  0 -5   0   0  -9 

Bit matrix before edge assignment:

  *  A   C  T  A
* 80 80  80 80 16  4 
A 80 84  44 44 20  4 
C 80 44  84 50 10  4 
G 80 44  73 84 14  4 
T 80 44 105 76 28  4 
A 64 68  33 37  4  4 
   4  4   4  4  4  4

Bit matrix after edge assignment:

   *  A  C  T  A
*  0 80 80  0  0  4  
A 80  4  8 32  4  4 
C 80 32  4  2 34  4 
G 80 40  1  4  6  4 
T  0  8 41 21  4  4 
A  0  4  9 13  4  4 
   4  4  4  4  4  4 
ArtPoon commented 4 years ago

I think that 21 in the last matrix is problematic - encoding 16, 4 and 1 (e, c, a), where a indicates a vertical edge is on the optimal path.