Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
506 stars 165 forks source link

Edlib (sometimes) fails when none the characters of one seq1 are present in seq2 #103

Closed iprada closed 6 years ago

iprada commented 6 years ago

Hi, I have a error in edlib that occurs when you try to align sequences that have no common character beween them I have distill the problem to a minimal code example:

import edlib

seq1 = 'GTGGAGCGCGCCGCCACGGACCACGGGCGGGCTGGCGGGCGAGCGGCGAGCGCGCGGCGATCCGAGCCCCTAGGGCGGATCCCGGCTCCAGGCCCGCGCG'

seq2 = 'NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN'

print(edlib.align(seq1, seq2, mode='HW', task='path'))

where seq1 is a normal DNA sequence and seq2 is a sequence full of 'N' letters, as it occurs sometimes in Next Generation Sequencing reads or in the genome assemblies.

When I execute this code, sometimes it works perfectly with the expected output:

{'editDistance': 100, 'alphabetLength': 5, 'locations': [(0, -1), (0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (0, 8), (0, 9), (0, 10), (0, 11), (0, 12), (0, 13), (0, 14), (0, 15), (0, 16), (0, 17), (0, 18), (0, 19), (0, 20), (0, 21), (0, 22), (0, 23), (0, 24), (0, 25), (0, 26), (0, 27), (0, 28), (0, 29), (0, 30), (0, 31), (0, 32), (0, 33), (0, 34), (0, 35), (0, 36), (0, 37), (0, 38), (0, 39), (0, 40), (0, 41), (0, 42), (0, 43), (0, 44), (0, 45), (0, 46), (0, 47), (0, 48), (0, 49), (0, 50), (0, 51)], 'cigar': '100I'}

and sometimes the program finishes with the following exit code:

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Martinsos commented 6 years ago

Hi @iprada! Thank you for reporting this, it was indeed a (rare) bug. I released updated version of both C++ and Python binding (v1.2.3), check it out, it should work correctly. Please test it and if everything is fine close the issue.

The problem was when one of optimal solutions was having query completely in front of target, like this:

   CTG
NNN

because that makes end location = -1, which confused the algorithm. I put an exception for that, so algorithm knows how to behave in that case and everything works, however it is still a big question if having end location of -1 does make sense, what should the start location be in that case and how to handle such situations in general. I will open another issue for that to figure out and then possibly redefine and reimplement start and end locations in such corner cases.

Martinsos commented 6 years ago

Fixed!