TkTech / mutf8

Pure-python and optional C encoders/decoders for MUTF-8/CESU-8.
MIT License
11 stars 3 forks source link

Error in python code #5

Open VectorASD opened 1 year ago

VectorASD commented 1 year ago

While decoding a 6-byte value, you have "0x10000 |". It's not right to do so. Due to the fact that the usual unicode in the construction 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx allows you to encode 21 bits, in MUTF8 you only have 20 bits available, so you need to ADD 0x10000, and not turn the OR operation. In coding, these 0x10000 are not taken into account at all. Try to encode for example "📋" yourself, and then decode it. As a result, we get what 🥴.

armijnhemel commented 1 year ago

I tried to replicate it but couldn't:

>>> import mutf8
>>> emoji = '\U0001f4cb'
>>> encoded = mutf8.encode_modified_utf8(emoji)
>>> encoded
b'\xed\xa1\xbd\xed\xb3\x8b'
>>> mutf8.decode_modified_utf8(encoded) == emoji
True
TkTech commented 1 year ago

@VectorASD Would you be able to provide a minimal reproduction example?

VectorASD commented 1 year ago

If you miss the fact that str_data_b uses is my own class, which allows you to write sectors of a dex file independently of each other, and at the end to glue and put down the binding data, then this is exactly what a fully working MUTF-8 will look like:

import io

def MUTF8(Str):
    # sdb = self.str_data_b
    sdb = io.BytesIO()
    Str = [ord(let) for let in Str]
    L = len(Str) + sum(1 for c in Str if c >= 0x10000)
    #pos = sdb.tell()
    #  sdb.pos()
    #  sdb.uleb128(L)
    for c in Str:
      if c == 0: data = 192, 128
      elif c < 128: data = (c,) # 7 битов
      elif c < 0x800: data = 192 | c >> 6, 128 | (c & 63) # 5 + 6 = 11 битов
      elif c < 0x10000: data = 224 | c >> 12, 128 | (c >> 6 & 63), 128 | (c & 63) # 4 + 6 + 6 = 16 битов
      else:
        c -= 0x10000
        data = ( # 4 + 6 + 4 + 6 = 20 битов
          237, 160 | c >> 16, 128 | (c >> 10 & 63),
          237, 176 | (c >> 6 & 15), 128 | (c & 63))
      sdb.write(bytes(data))
    #  sdb.write(b"\0")
    #pos2 = sdb.tell()
    #sdb.seek(pos)
    #print("•", sdb.data.read(pos2 - pos).hex())
    #assert sdb.tell() == pos2
    print(sdb.getvalue().hex())
MUTF8("\U0001f4cb")

it will print: eda0bdedb38b, instead of the erroneous eda1bdedb38b

armijnhemel commented 1 year ago

I tried one of the examples from https://docs.rs/residua-mutf8/latest/mutf8/ and I can see that the two implementations are giving different results. The Rust version converts \U00010401 to b'\xed\xa0\x81\xed\xb0\x81' whereas the mutf8 python package gives b'\xed\xa1\x81\xed\xb0\x81' as the result:

>>> bla = '\U00010401'
>>> mutf8.encode_modified_utf8(bla)
b'\xed\xa1\x81\xed\xb0\x81'

When testing with some data from Android ( https://android.googlesource.com/platform/development/+/63bf1087ebb06b59e3d82cbc5ccd4485704c6b91/vndk/tools/definition-tool/tests/test_dex_file.py#29 ) I see the same thing happen. So it seems that @VectorASD is correct that there is an error.

armijnhemel commented 1 year ago

I tried one of the examples from https://docs.rs/residua-mutf8/latest/mutf8/ and I can see that the two implementations are giving different results. The Rust version converts \U00010401 to b'\xed\xa0\x81\xed\xb0\x81' whereas the mutf8 python package gives b'\xed\xa1\x81\xed\xb0\x81' as the result:

>>> bla = '\U00010401'
>>> mutf8.encode_modified_utf8(bla)
b'\xed\xa1\x81\xed\xb0\x81'

When testing with some data from Android ( https://android.googlesource.com/platform/development/+/63bf1087ebb06b59e3d82cbc5ccd4485704c6b91/vndk/tools/definition-tool/tests/test_dex_file.py#29 ) I see the same thing happen. So it seems that @VectorASD is correct that there is an error.

It seems that the problem is in the encoding, not the decoding:

>>> bla = '\U00010401'
>>> encoded = mutf8.encode_modified_utf8(bla)
>>> encoded
b'\xed\xa1\x81\xed\xb0\x81'
>>> encoded2 = b'\xed\xa0\x81\xed\xb0\x81'
>>> encoded == encoded2
False
>>> bla == mutf8.decode_modified_utf8(encoded)
True
>>> bla == mutf8.decode_modified_utf8(encoded2)
True