Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
493 stars 162 forks source link

Distance with special characters #114

Closed erikradisch closed 6 years ago

erikradisch commented 6 years ago

It seems to me, that edlib does not calculate the right distance, if there are special characters (with diacritic signs). for example: übund - ubung should have a distance of 1 but I end up with 3. Is this a bug or is it wanted?

Martinsos commented 6 years ago

Hi @erikradisch , thanks for reaching out! Please check similar issues: #109 #104 #79 #89, each of them should hold an answer to your question with some more detailed explanation and suggestions from my side. To put it super shortly: unfortunately, edlib for now does not support multibyte characters. ü gets represented as two chars, and therefore edit distance is not what you would expect. It is not a bug, it is expected behaviour at the moment. However, I do plan to add feature to support multibyte strings and actually any type of sequence very soon. Btw., would you mind answering me a question (I am trying to understand better how people use Edlib): are you using edlib as python package or as C/C++ library? What is the main purpose you are using it for? Thanks!

erikradisch commented 6 years ago

Sure! I use the python package. I use it to align historical place names to a gazetteer. Your algorithm has two huge puses. first, it can be aborted, if levenshtein reaches a limit, second, you can align additional equalities, which is very important, as there are a lot of predictable differences, which are in fact equalities in historical place names (c instead of an k for example)

Martinsos commented 6 years ago

Awesome, thanks for sharing, and it is nice to hear those features are useful :). When we do #90 and #77 it should support your use case even better.