Closed ProKil closed 3 years ago
Thanks for reporting this!
So the problem seems to be in the .encode('ascii')
. The rest of the code supports 256 different values, but ascii
does not, and this is where the thing fails. What we are trying to do in that piece of code is map the given alphabet into bytes, and we do it by first mapping them into characters with ordinal code from 1 to 256 (depends on how big alphabet is) and then we encode that into bytes while using ascii as encoding. However, if there are more than 128 characters, this breaks because ascii does not support so many different values! So we are doing this in the wrong way.
Off the bat I have following ideas on how to fix this:
utf-8
instead of ascii
. I am not sure if this is ok fix, my knowledge of encodings is superficial and would have to investigate it further.For (2), I am guessing we should modify code to be smth like:
alphabet = query_vals.union(target_vals)
if len(alphabet) > 256:
raise ValueError(
"query and target combined have more than 256 unique values, "
"this is not supported.")
alphabet_to_byte_mapping = {
c: idx.to_bytes(1, byteorder='big') for idx, c in enumerate(alphabet)
}
map_seq = lambda seq: b''.join(alphabet_to_byte_mapping[c] for c in seq)
query_bytes = map_seq(query)
target_bytes = map_seq(target)
if additional_equalities is not None:
additional_equalities = [
(alphabet_to_byte_mapping[a].decode('utf-8'), alphabet_to_byte_mapping[b].decode('utf-8'))
for a, b in additional_equalities
if a in alphabet_to_byte_mapping and b in alphabet_to_byte_mapping]
@jbaiter I believe you authored this piece of code, it would be great to hear your opinion on this, I guess you will probably have some better insights than me?
Phew, this was quite some time ago!
But I think your analysis of the root cause is correct @Martinsos, the .encode('ascii')
is a pretty dumb way to do this in hindsight, and as you correctly noticed, the general approach was pretty convoluted as well :-D
Judging from a quick reading of your proposed fix, it looks good to me (and so so much simpler and more straightforward than the one from my PR), have you tried it against the test suite?
Phew, this was quite some time ago! But I think your analysis of the root cause is correct @Martinsos, the
.encode('ascii')
is a pretty dumb way to do this in hindsight, and as you correctly noticed, the general approach was pretty convoluted as well :-DJudging from a quick reading of your proposed fix, it looks good to me (and so so much simpler and more straightforward than the one from my PR), have you tried it against the test suite?
Hey thanks for quick answer :)! Ok great, I just wanted to sure I am making sense with this! Since you think it is a reasonable approach, I will take some time these days then to test it out with the test suite and if it works, I will update the code and publish fixed version!
I updated the code with changes I suggested, and fixed one line in edlib.cpp where char was overflowing when alphabet was 256 -> it seems all works ok now! I released new version of python binding, so v1.3.9 on PyPI, and I also created new release on Github. @ProKil if you download the newest python version, it should work ok. I will close the issue for now but let me know if problem still persists!
Describe the bug A clear and concise description of what the bug is. UnicodeEncodeError raised when there are more than 129 unique values in query and target combined.
To Reproduce Code sample / steps to reproduce the behavior.
Expected behavior A clear and concise description of what you expected to happen. Either return
{'editDistance': 0, 'alphabetLength': 129, 'locations': [(None, 128)], 'cigar': None}
or raise ValueError.Environment (please complete the following information):
Additional context Add any other context about the problem here. Quick fix: change
256
in this line to128
.