Open davidchampredon opened 4 years ago
Thanks for reporting this @davidchampredon I've reproduced the issue.
Score is consistent with the correct alignment
T A C G T A
T A C - T A
+5 +5 +5 -11 +5 +5 = 14
It's issue #16 - I knew I'd pay for that cheap fix some day.. Commenting out these lines: https://github.com/ArtPoon/gotoh2/blob/348f7f790a62a3aee68c1188a1d83a07706f5c2b/src/_gotoh2.c#L121-L126
yields the correct alignment, but breaks two other unit tests.
It wasn't those lines - the problem was that local alignment requires one to zero out positive costs in the R matrix.
Crap. Traceback failed while I was validating these changes on SARS-COV-2 sequences:
Process 0 of 1 running hCoV19/pangolin/Guangxi/P4L/2017|EPI_ISL_410538|2017
traceback failed: i=1 j=1 bit=0
Traceback (most recent call last):
File "scripts/first-align.py", line 135, in <module>
main()
File "scripts/first-align.py", line 119, in main
trim_seq, inserts = pair_align(refseq, s, threshold=args.threshold)
File "scripts/first-align.py", line 66, in pair_align
aref, aquery, ascore = aligner.align(ref, query)
File "/usr/local/lib/python3.6/dist-packages/gotoh2-0.1-py3.6-linux-x86_64.egg/gotoh2.py", line 99, in align
self.matrix
RuntimeError: Traceback failed, try local alignment
Temporary patch:
revert these lines of code: https://github.com/ArtPoon/gotoh2/blob/65a29a5060d55140c40eb6410a857a842e9eb21f/src/_gotoh2.c#L190-L192
SARS-COV-2 alignments looked fine but let's switch to another alignment program and write a wrapper for gotoh2 to use that program by default (C program optional)
time permitting, diagnose traceback issue...
Cost matrix:
* A C T A
* 0 0 0 0 0
A 0 -5 0 0 -5
C 0 0 -10 0 0
G 0 0 0 -6 0
T 0 0 0 -4 -2
A 0 -5 0 0 -9
Bit matrix before edge assignment:
* A C T A
* 80 80 80 80 16 4
A 80 84 44 44 20 4
C 80 44 84 50 10 4
G 80 44 73 84 14 4
T 80 44 105 76 28 4
A 64 68 33 37 4 4
4 4 4 4 4 4
Bit matrix after edge assignment:
* A C T A
* 0 80 80 0 0 4
A 80 4 8 32 4 4
C 80 32 4 2 34 4
G 80 40 1 4 6 4
T 0 8 41 21 4 4
A 0 4 9 13 4 4
4 4 4 4 4 4
I think that 21
in the last matrix is problematic - encoding 16, 4 and 1 (e, c, a), where a
indicates a vertical edge is on the optimal path.