Open gentlegiantJGC opened 1 year ago
This is the best documentation I have found on this part of the format. https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF
It looks like you need to add 0x10000
not or it like you are
Thanks for the effort on this. Do you have test strings (or a file) this was failing to parse so I can write tests for it and verify your fixes?
Here are some unicode indexes and the mutf-8 byte sequences that they should encode to.
(
(0, b"\xC0\x80"),
(1, b"\x01"),
(2, b"\x02"),
(4, b"\x04"),
(8, b"\x08"),
(16, b"\x10"),
(32, b"\x20"),
(64, b"\x40"),
(128, b"\xc2\x80"),
(256, b"\xc4\x80"),
(512, b"\xc8\x80"),
(1024, b"\xd0\x80"),
(2048, b"\xe0\xa0\x80"),
(4096, b"\xe1\x80\x80"),
(8192, b"\xe2\x80\x80"),
(16384, b"\xe4\x80\x80"),
(32768, b"\xe8\x80\x80"),
(65536, b"\xed\xa0\x80\xed\xb0\x80"),
(131072, b"\xed\xa1\x80\xed\xb0\x80"),
(262144, b"\xed\xa3\x80\xed\xb0\x80"),
(524288, b"\xed\xa7\x80\xed\xb0\x80"),
(1048576, b"\xed\xaf\x80\xed\xb0\x80"),
)
I took a quick look and your changes break other tests that I'm fairly confident are correct, so I will need to take a closer look at this after the long weekend.
@TkTech Any news on this? Here's an implementation of the conversion that doesn't have this bug: https://gist.github.com/BarelyAliveMau5/000e7e453b6d4ebd0cb06f39bc2e7aec Unfortunately, it's just a random Gist without PyPI package or so. Of course, it would be desirable to have a working implementation as part of a well-maintained package.
Here's the example that lead me here:
>>> import mutf8
>>> mutf8.encode_modified_utf8("𝕭")
b'\xed\xa1\xb5\xed\xb5\xad'
>>> utf8s_to_utf8m("𝕭".encode())
b'\xed\xa0\xb5\xed\xb5\xad'
The thing that originally brought this to my attention was the bow and arrow emoji.
This 🏹 should encode to \xed\xa0\xbc\xed\xbf\xb9
in mutf8 but your library encodes it to \xed\xa1\xbc\xed\xbf\xb9
I have been doing a deep dive into the the code of this library and found something funky going on with the surrogate pair encoding.
The issue seems to be from the top 4 bits but I can't find a simple document explaining how they are supposed to work.
I will update this with more info when I have it but my current findings are below.
Column 1 is
v
Column 2 isord(decode_modified_utf8(encode_modified_utf8(chr(v))))
(note how the last 4 values do not match the input Column 3 isencode_modified_utf8(chr(v))
Column 4 is column 3 in binaryI think the issue is on encoding because using other decoding tools gives the same value decoding the encoded value.