Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
493 stars 162 forks source link

Unicode support #140

Closed jaroslavgratz closed 4 years ago

jaroslavgratz commented 4 years ago

The edlib works with ASCII characters only however many languages use extended characters set. Do you plan to support also unicode (wide) characters?

Please note an incorrect result when non-ASCII characters are used:

>>> import edlib
>>> edlib.align("á", "é")
{'editDistance': 0, 'alphabetLength': 1, 'locations': [(None, 0)], 'cigar': None}
Martinsos commented 4 years ago

Ah yes, this is a feature requested multiple times so far! Sorry for responding so late, I blame the holidays :P. Edlib consists of two parts: C/C++ core and Python package wrapping it. C/C++ core does not offer support for anything more than char for now, but Python package does have support for arbitrary sequence of hashable objects, as longs as the alphabet is less than 256 distinct characters. Real support for generic sequences (meaning one element of sequence can be anything, not just char) is actually on its way for the C/C++ core as we speak, so once that is done (month or two, not sure), Edlib will have true support for uses cases like this.

I believe the python feature I described should be good enough for your case, right? I wanted to point out to documentation on PYPI about this feature, but I see now there is none, and I believe there is actually PR waiting to fix this which I forgot to tend to, so I will get to that now.

To summarize: use python feature for sequences of hashable objects (help(edlib.align) to see more about it) for now, and in a couple of months hopefully you can expect even better support.

I will close this issue, but I opened another one here to take care of this: https://github.com/Martinsos/edlib/issues/141 . If you are interested in tackling it, feel free to do so and create a PR, it should be an easy one. I can give you a hand if needed.

Martinsos commented 4 years ago

Actually I am being silly @jaroslavgratz , this has already been taken care of! I believe this already worked when you posted this comment -> when have you last updated edlib? I am trying it now, version 1.3.6, and I get:

>>> edlib.align("á", "é")
{'editDistance': 1, 'alphabetLength': 2, 'locations': [(None, 0)], 'cigar': None}