Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
506 stars 165 forks source link

Incorrect editDistance on russian cyrillic characters #104

Closed reefactor closed 6 years ago

reefactor commented 6 years ago

Hi, I'm using edlib 1.2.1 from pypi on python3.5.2 (inside docker python3.5.2) It gives unexpected incorrect output on some russian cyrillic characters:

>>> edlib.align('ты милая', 'ты гений')["editDistance"]
1
>>> edlib.align('м', 'г')["editDistance"]
0

Can you suggest what can be wrong?

reefactor commented 6 years ago

Dockerfile:

FROM python:3.5.2

RUN pip3 install -r /tmp/requirements.txt
reefactor commented 6 years ago

requirements.txt:

Flask_Cors==3.0.2
Flask-Migrate==2.1.0
alembic==0.9.4
Flask-Script==2.0.5
Flask-SQLAlchemy==2.2
Flask-Mail==0.9.1
Flask-Security==3.0.0
PyMySQL==0.7.9
requests==2.13.0
gevent==1.2.1
gunicorn==19.7
PyYAML==3.12
jsonschema==2.5.1
transliterate==1.9
mako==1.0.6
blinker==1.4
stringdist==1.0.9
schedule==0.4.3
flask-paginate==0.5.0
elasticsearch==5.4.0
pandas==0.20.3
xlrd==1.0.0
raven==6.1.0
ujson==1.35
edlib==1.2.1
Martinsos commented 6 years ago

Hi! I am currently on the go, but please check out #79 and #89 - I believe problem is in Cyrillic letters being represented with multiple bytes, and in these two issues I explained the same thing and how to possibly resolve it. I am looking into implementing in edlib support for this but haven't got go it yet. Possible workaround is mapping to non-character codes as explained in those issues. Let me know how you find that!

reefactor commented 6 years ago

Thanks for suggestions #79 #89, I need to write encoder/decoder that may add overhead :) Your implementation is extremely fast on large strings, I'm looking forward for better unicode symbols support from you.

Martinsos commented 6 years ago

I understand, having support for more multi-byte sequences is on the list :)! Also, speed up for short sequences.