TkTech / mutf8

Pure-python and optional C encoders/decoders for MUTF-8/CESU-8.
MIT License
11 stars 3 forks source link

Surrogate pair encoding error #6

Open gentlegiantJGC opened 1 year ago

gentlegiantJGC commented 1 year ago

I have been doing a deep dive into the the code of this library and found something funky going on with the surrogate pair encoding.

The issue seems to be from the top 4 bits but I can't find a simple document explaining how they are supposed to work.

I will update this with more info when I have it but my current findings are below.

Column 1 is v Column 2 is ord(decode_modified_utf8(encode_modified_utf8(chr(v)))) (note how the last 4 values do not match the input Column 3 is encode_modified_utf8(chr(v)) Column 4 is column 3 in binary

65536       65536       b'\xed\xa1\x80\xed\xb0\x80'     0000000000000000111011011010000110000000111011011011000010000000
131072      196608      b'\xed\xa2\x80\xed\xb0\x80'     0000000000000000111011011010001010000000111011011011000010000000
262144      327680      b'\xed\xa4\x80\xed\xb0\x80'     0000000000000000111011011010010010000000111011011011000010000000
524288      589824      b'\xed\xa8\x80\xed\xb0\x80'     0000000000000000111011011010100010000000111011011011000010000000
1048576     65536       b'\xed\xa0\x80\xed\xb0\x80'     0000000000000000111011011010000010000000111011011011000010000000

I think the issue is on encoding because using other decoding tools gives the same value decoding the encoded value.

gentlegiantJGC commented 1 year ago

This is the best documentation I have found on this part of the format. https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF

It looks like you need to add 0x10000 not or it like you are

TkTech commented 1 year ago

Thanks for the effort on this. Do you have test strings (or a file) this was failing to parse so I can write tests for it and verify your fixes?

gentlegiantJGC commented 1 year ago

Here are some unicode indexes and the mutf-8 byte sequences that they should encode to.

(
    (0, b"\xC0\x80"),
    (1, b"\x01"),
    (2, b"\x02"),
    (4, b"\x04"),
    (8, b"\x08"),
    (16, b"\x10"),
    (32, b"\x20"),
    (64, b"\x40"),
    (128, b"\xc2\x80"),
    (256, b"\xc4\x80"),
    (512, b"\xc8\x80"),
    (1024, b"\xd0\x80"),
    (2048, b"\xe0\xa0\x80"),
    (4096, b"\xe1\x80\x80"),
    (8192, b"\xe2\x80\x80"),
    (16384, b"\xe4\x80\x80"),
    (32768, b"\xe8\x80\x80"),
    (65536, b"\xed\xa0\x80\xed\xb0\x80"),
    (131072, b"\xed\xa1\x80\xed\xb0\x80"),
    (262144, b"\xed\xa3\x80\xed\xb0\x80"),
    (524288, b"\xed\xa7\x80\xed\xb0\x80"),
    (1048576, b"\xed\xaf\x80\xed\xb0\x80"),
)
TkTech commented 1 year ago

I took a quick look and your changes break other tests that I'm fairly confident are correct, so I will need to take a closer look at this after the long weekend.

tuxor1337 commented 1 year ago

@TkTech Any news on this? Here's an implementation of the conversion that doesn't have this bug: https://gist.github.com/BarelyAliveMau5/000e7e453b6d4ebd0cb06f39bc2e7aec Unfortunately, it's just a random Gist without PyPI package or so. Of course, it would be desirable to have a working implementation as part of a well-maintained package.

Here's the example that lead me here:

>>> import mutf8
>>> mutf8.encode_modified_utf8("𝕭")
b'\xed\xa1\xb5\xed\xb5\xad'
>>> utf8s_to_utf8m("𝕭".encode())
b'\xed\xa0\xb5\xed\xb5\xad'
gentlegiantJGC commented 7 months ago

The thing that originally brought this to my attention was the bow and arrow emoji. This 🏹 should encode to \xed\xa0\xbc\xed\xbf\xb9 in mutf8 but your library encodes it to \xed\xa1\xbc\xed\xbf\xb9