Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
493 stars 162 forks source link

Failure to align strings represented in Unicode (misleading results) #109

Closed xtknight closed 6 years ago

xtknight commented 6 years ago

In Python:

edlib.align("고통스러워", "고통스럽다", task="path", mode="NW")
{'locations': [(0, 4)], 'cigar': '5=', 'alphabetLength': 5, 'editDistance': 0}

(I think that the first three characters should be kept the same and the last two should be changed, but the 'cigar' output implies all five characters are the same!)

Is there any possible way to get this working properly for characters only representable in Unicode, even if it's messy? :( Or, at least, if I input it as UTF-16 or UTF-32 can the algorithm just work on 2 or 4 byte intervals? I realize UTF-8 might be difficult because it's variable-length but...desperate to get something working just for research purposes only. Any pointers would be greatly appreciated.

Martinsos commented 6 years ago

Hi @xtknight, sorry for late response: please check this issue, as it is exactly the same thing but with cyrilic characters: https://github.com/Martinsos/edlib/issues/104.

I would advise for now encoding the characters you have into bytes, as mentioned in the link I provided, if that is enough bits for all your characters.

I plan to add support for custom types of characters, that would support your case, however I have not yet got to it and it will probably take me some time to find to do it!

Martinsos commented 6 years ago

@xtknight did the answer help? Should we close this one?

xtknight commented 6 years ago

@Martinsos sorry for the lack of response. I suppose this would work fine, and eventually that's what I'll probably end up doing. But even in this case, I think it might break with certain UTF characters that require special sequences. And Unicode equivalence also would have no chance of working properly. In the ideal situation, the library would accept Unicode and handle Unicode equivalence, etc...but I realize that might be difficult to implement.

I actually went ahead and modified all the C++ source code to try to use UTF-16 a while ago and ended up with a mess. For some reason things weren't working as I desired. If it were something useful I certainly would have posted it here, and I'll have to use vague terms because I don't remember everything exactly, but the interface between Python and C++ also only transmitted the first byte of Unicode characters. On the C++ side there were just a bunch of problems because of the way things were hard-coded to use 8-bytes, such as the constant declarations exploding in size (changing sizes from 256(8-bit) to 65536(16-bit) basically) and causing extremely hard-to-debug segmentation faults. I'd love to contribute and help, but due to my lack of familiarity with the code base I don't think it would be practical and I'd probably just end up making bugs. I ended up with some extremely stripped down version that didn't perform the kind of alignment I was looking for and eventually couldn't figure out the alignment algorithm or why, and figured I wouldn't be able to receive support in that matter as I had already mangled the code quite a bit. Sorry for the vagueness.

I don't remember exactly where I left off, but anyways I think the only sensible solution to this issue is to accept an arbitrary type of any size and perform alignments based on that arbitrary type, rather than forcing to 8-bit unsigned integer input. And also we'd need an arbitrary equality operator function for the arbitrary types as well. Anyways, I'm sure you're aware this is probably the best solution but it's just a matter of implementation. If you already have a bug open for implementing such a feature, it would make sense to mark this as duplicate or close this. Otherwise, I suppose it's quite an important feature for the library to accept at least true Unicode from the Python side. Thanks~

Martinsos commented 6 years ago

Hi @xtknight , I completely agree, and there is actually an issue for that, and it will be one of the very next things to implement: https://github.com/Martinsos/edlib/issues/90 . I don't think it is actually that hard to implement, however it does take good understanding of how things are now implemented right now and will require changes on multiple places, plus updating both C/C++ and python interface, so it will take decent amount of work. I appreciate your effort and I am sorry I can't help more right now, but I do hope we will have this enhancement soon :)! If you would like to take it on I would be completely fine with that, but it will require you to understand a few concepts and then propagate it through whole edlib. My thought was actually to maybe first switch from C/C++ to pure C++ interface, which will also allow this change to be more easier to do. I am closing this issue then but feel free to comment further if you wish.