Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
492 stars 162 forks source link

UnicodeEncodeError in Python binding #184

Closed ProKil closed 2 years ago

ProKil commented 2 years ago

Describe the bug A clear and concise description of what the bug is. UnicodeEncodeError raised when there are more than 129 unique values in query and target combined.

To Reproduce Code sample / steps to reproduce the behavior.

>>> from edlib import align
>>> align(list(range(129)), list(range(129)))
Traceback (most recent call last):
  File "edlib.pyx", line 32, in edlib._map_to_bytes
  File "edlib.pyx", line 19, in edlib._map_ascii_string
edlib.NeedsAlphabetMapping

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "edlib.pyx", line 95, in edlib.align
  File "edlib.pyx", line 48, in edlib._map_to_bytes
  File "edlib.pyx", line 47, in edlib._map_to_bytes.lambda
UnicodeEncodeError: 'ascii' codec can't encode character '\x80' in position 128: ordinal not in range(128)

Expected behavior A clear and concise description of what you expected to happen. Either return {'editDistance': 0, 'alphabetLength': 129, 'locations': [(None, 128)], 'cigar': None} or raise ValueError.

Environment (please complete the following information):

Additional context Add any other context about the problem here. Quick fix: change 256 in this line to 128.

Martinsos commented 2 years ago

Thanks for reporting this!

So the problem seems to be in the .encode('ascii'). The rest of the code supports 256 different values, but ascii does not, and this is where the thing fails. What we are trying to do in that piece of code is map the given alphabet into bytes, and we do it by first mapping them into characters with ordinal code from 1 to 256 (depends on how big alphabet is) and then we encode that into bytes while using ascii as encoding. However, if there are more than 128 characters, this breaks because ascii does not support so many different values! So we are doing this in the wrong way.

Off the bat I have following ideas on how to fix this:

  1. Use utf-8 instead of ascii. I am not sure if this is ok fix, my knowledge of encodings is superficial and would have to investigate it further.
  2. Instead of first mapping every letter in alphabet to an index, then mapping index to char, and then encoding that char as byte, why don't we directly generate bytes based on indexes, skip the part where we assign these chars?

For (2), I am guessing we should modify code to be smth like:

         alphabet = query_vals.union(target_vals)
         if len(alphabet) > 256:
            raise ValueError(
                "query and target combined have more than 256 unique values, "
                "this is not supported.")
        alphabet_to_byte_mapping = {
            c: idx.to_bytes(1, byteorder='big') for idx, c in enumerate(alphabet)
        }
        map_seq = lambda seq: b''.join(alphabet_to_byte_mapping[c] for c in seq)
        query_bytes = map_seq(query)
        target_bytes = map_seq(target)
        if additional_equalities is not None:
            additional_equalities = [
                (alphabet_to_byte_mapping[a].decode('utf-8'), alphabet_to_byte_mapping[b].decode('utf-8'))
                for a, b in additional_equalities
                if a in alphabet_to_byte_mapping and b in alphabet_to_byte_mapping]

@jbaiter I believe you authored this piece of code, it would be great to hear your opinion on this, I guess you will probably have some better insights than me?

jbaiter commented 2 years ago

Phew, this was quite some time ago! But I think your analysis of the root cause is correct @Martinsos, the .encode('ascii') is a pretty dumb way to do this in hindsight, and as you correctly noticed, the general approach was pretty convoluted as well :-D

Judging from a quick reading of your proposed fix, it looks good to me (and so so much simpler and more straightforward than the one from my PR), have you tried it against the test suite?

Martinsos commented 2 years ago

Phew, this was quite some time ago! But I think your analysis of the root cause is correct @Martinsos, the .encode('ascii') is a pretty dumb way to do this in hindsight, and as you correctly noticed, the general approach was pretty convoluted as well :-D

Judging from a quick reading of your proposed fix, it looks good to me (and so so much simpler and more straightforward than the one from my PR), have you tried it against the test suite?

Hey thanks for quick answer :)! Ok great, I just wanted to sure I am making sense with this! Since you think it is a reasonable approach, I will take some time these days then to test it out with the test suite and if it works, I will update the code and publish fixed version!

Martinsos commented 2 years ago

I updated the code with changes I suggested, and fixed one line in edlib.cpp where char was overflowing when alphabet was 256 -> it seems all works ok now! I released new version of python binding, so v1.3.9 on PyPI, and I also created new release on Github. @ProKil if you download the newest python version, it should work ok. I will close the issue for now but let me know if problem still persists!