Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
493 stars 162 forks source link

change project description to reflect edlib only aligns bytes #123

Closed bertsky closed 5 years ago

bertsky commented 5 years ago

Hi @Martinsos,

I have run into the same problem as #79, #89, #104, #109, #114. That is: if the inputs contain non-ASCII characters, then alignment (as represented by the CIGAR string) will be wrong.

The same goes for additionalEqualities, which merely takes the first byte and ignores the rest of the character.

Of course I could decode my strings into byte sequences and pass those instead. But this really is not the same problem – I do need minimal character/codepoint alignments (and character equalities cannot be expressed with isolated bytes at all). And it is definitely not what the module is announced to be doing.

All issues mentioned above have been closed already, but as long as the code has not been generalised to work on arbitrary characters (as you said you will), this is either an open bug or a misrepresentation of the module's capabilities.

Please change the (Python) module description (and docstring) and README.md to reflect that, or others will keep running into this as well.

Otherwise: great project, thanks!

Martinsos commented 5 years ago

Hi @bertsky, you are right, this is my mistake! Since Edlib started as C++ library, where it was more clear that one character should be one byte, and python binding came later, I did not realize that in Python version it is not clear that characters has to be one byte and that people will try to use it that way.

I did following:

  1. I added note to C++ documentation (although there it should be clear, but still) to emphasize that character == one chart/byte, not multiple.
  2. @jbaiter created a PR for python version that automatically does mapping of input alphabet to space from 0 to 255 (if needed), therefore solving the problem for alphabets of size <= 256 characters. If size is bigger, error is thrown. I merged it and published new version of python binding.
  3. I added NOTE to python version README, to explain that alphabet has to be <= 256 characters.

While this means there is still no support for multi-byte sequences of characters in C++ version, we do now have solution from @jbaiter that takes care of such sequences in Python (as long as total number of unique values is <= 256), which should take care of most cases (and it not we throw error, so that is also good).

I believe this should be enough! Also, if you wish to contribute by adding support for multi-byte sequences in C++, pls do let me know, I can give some guidelines.

I will close this issue for now, but please let me know if anything is wrong and we can continue discussion / reopen issue.

bertsky commented 5 years ago

@Martinsos, I fully understand – this is probably the most natural gotcha when providing Python bindings for char* functions. (I am afraid I don't have time to help with a wstring based solution myself, sorry!) I really appreciate your addressing this. I am fine with closing now, although I do believe the warning could still be made more prominent in bindings/python/README.rst and setup.py by a small change in the description, e.g.:

Lightweight, super fast library for (byte) sequence alignment using edit (Levenshtein) distance.

instead of

Lightweight, super fast library for sequence alignment using edit (Levenshtein) distance.

I am not so convinced by the approach of automatic alphabet mapping in the Python version, though. Having an implicit restriction on the alphabet size, however large it may be (and 256 is not particularly large for practical purposes in many scripts, including Latin-based ones), only hides the problem for a while: Instead of forcing me to choose another library right away, it has me implement something based on edlib, and wait for the error to appear when I begin testing (if I am lucky). I would rather recommend making the Python version overtly based on bytes instead of str. (But I am not sure if Python 2 compatibility would be possible then.)