Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
506 stars 165 forks source link

edlib and unicode in python3 #89

Closed christian-storm closed 7 years ago

christian-storm commented 7 years ago

Hi Martin,

I've been playing around with your great edlib library in python. Wow, is it ever fast!

I'm trying to use edlib for text comparison and am getting some odd behavior with unicode. I've copied my little test program below so you can see what I mean. The issue stems from the fact that text is encoded into bytes (edlib.pyx) before aligning. Would casting the python string as unicode instead of bytes alleviate this issue? Not sure if you have any thoughts on how one might add unicode support.

Output of test program: a and b both ascii a: Testing What b: Testing_What cigar: 7=1X4= expected cigar: 7=1X4=

a is ascii and b is ascii with one unicode/multi-byte character a: Testing What b: Testing✐What cigar: 7=1X2D2=2I expected cigar: 7=1X4=

a and b are ascii with one unicode/multi-byte character in the same position a: Testing✑What b: Testing✐What cigar: 9=1X2= expected cigar: 7=1X4=

edlib_test.py

import edlib

a = "Testing What"
b = "Testing_What"

result = edlib.align(a, b, task="path")
print("a and b both ascii")
print("a: {}".format(a))
print("b: {}".format(b))
print("cigar: {} expected cigar: {}\n".format(result['cigar'], '7=1X4='))

a = "Testing What"
b = "Testing✐What"

result = edlib.align(a, b, task="path")
print("a is ascii and b is ascii with one unicode/multi-byte character")
print("a: {}".format(a))
print("b: {}".format(b))
print("cigar: {} expected cigar: {}\n".format(result['cigar'], '7=1X4='))

a = "Testing✑What"
b = "Testing✐What"

result = edlib.align(a, b, task="path")
print("a and b are ascii with one unicode/multi-byte character in the same position")
print("a: {}".format(a))
print("b: {}".format(b))
print("cigar: {} expected cigar: {}\n".format(result['cigar'], '7=1X4='))
Martinsos commented 7 years ago

Hi @christian-storm, thank you for reaching out and thank you for kind words :). Ah yes, the problem here is with the multibyte characters. Edlib's core is written in C++, and it actually takes sequence as an array of chars. Python package is transforming Python string into list of bytes and then passing that to C++ code, since one char is one byte.

So the problem is that Edlib assumes one byte is one character, and there is no way around it currently. When you give it a multibyte character, it actually thinks it is two independent characters.

One way to handle this right now is by transforming your alphabet. What that means is, if you can map those multibyte characters to some free singlebyte characters that you are not using otherwise, you can run edlib and you will get correct results. However, if you are already using all the single byte characters then there is no way around it currently.

In the future, what I could do is improve Edlib so it does not take an array of chars. Instead, it would take array of objects that have equality operator defined upon them, making it more general. However, that is not a small change so I cant really promise anything at the moment.

Hope that helps! By the way, there was very similar issue where I explained the idea of mapping the multibyte characters to single-byte characters in more details, you can check it out here: https://github.com/Martinsos/edlib/issues/79.

christian-storm commented 7 years ago

Thanks Martinos. I'll keep my fingers crossed that you find some time :) I'd dig in myself if I weren't so rusty at C++.

Martinsos commented 7 years ago

Thanks :). By the way, what are you using edlib for? Mapping aplhabet to single-byte characters is not an option for you?

christian-storm commented 7 years ago

Apologies for not responding. I was looking for a faster replacement for python's difflib.