Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
493 stars 162 forks source link

add checker for empty sequence #137

Closed y9c closed 4 years ago

y9c commented 4 years ago

fix bug when query sequence is empty.

y9c commented 4 years ago
>>> edlib.align("", "test")
>>> edlib.align("bbb", "", "NW")
{'editDistance': 3, 'alphabetLength': 1, 'locations': [(None, -1)], 'cigar': None}

>>> edlib.align("bbb", "", "HW")
{'editDistance': -720667428, 'alphabetLength': 1, 'locations': [(None, -3), (None, -2), (None, -1)], 'cigar': None}

>>> edlib.align("bbb", "", "SHW")
{'editDistance': -61, 'alphabetLength': 1, 'locations': [(None, -2), (None, -1)], 'cigar': None}
Martinsos commented 4 years ago

Hi @yech1990, thanks for your contribution :)!

While you fixed the problem in Python for NW mode, modes HW and SHW need special handling, and finally problem is actually coming from C++ code, so ideally it would be fixed there.

I already started working on this fix some time ago but then put it aside and forgot about it. I hope you don't mind that, inspired by your PR, I took on this old fix-in-progress of mine and finalized it: 4ecfd34cb51b9aeef36815372fe51f543ae2903f . It fixes problem for all 3 modes (NW, SHW, HW) directly in C++, therefore also fixing it for Python. I added some tests for this, and I also published both new version of Python package and new version of C++ library.

Thanks for help with this!

y9c commented 4 years ago

Hi @Martinsos, thank your for the fix. It works perfectly now.

I realized that a check for input mode is also added into the C++ lib now, but the python module is not updated correspondingly. (It's very low priority)

>>> edlib.align("", "aabbcc", mode="other")
{'editDistance': 6, 'alphabetLength': 3, 'locations': [(None, 5)], 'cigar': None}
Martinsos commented 4 years ago

True @yech1990, we don't check if mode as given correct value! That is a good suggestion - would you mind implementing a short PR for this? Also add one test for it, as I did in my PRs (it is terrible way to define tests I know, all manual, but it was quick solution).