Open VectorASD opened 1 year ago
I tried to replicate it but couldn't:
>>> import mutf8
>>> emoji = '\U0001f4cb'
>>> encoded = mutf8.encode_modified_utf8(emoji)
>>> encoded
b'\xed\xa1\xbd\xed\xb3\x8b'
>>> mutf8.decode_modified_utf8(encoded) == emoji
True
@VectorASD Would you be able to provide a minimal reproduction example?
If you miss the fact that str_data_b uses is my own class, which allows you to write sectors of a dex file independently of each other, and at the end to glue and put down the binding data, then this is exactly what a fully working MUTF-8 will look like:
import io
def MUTF8(Str):
# sdb = self.str_data_b
sdb = io.BytesIO()
Str = [ord(let) for let in Str]
L = len(Str) + sum(1 for c in Str if c >= 0x10000)
#pos = sdb.tell()
# sdb.pos()
# sdb.uleb128(L)
for c in Str:
if c == 0: data = 192, 128
elif c < 128: data = (c,) # 7 битов
elif c < 0x800: data = 192 | c >> 6, 128 | (c & 63) # 5 + 6 = 11 битов
elif c < 0x10000: data = 224 | c >> 12, 128 | (c >> 6 & 63), 128 | (c & 63) # 4 + 6 + 6 = 16 битов
else:
c -= 0x10000
data = ( # 4 + 6 + 4 + 6 = 20 битов
237, 160 | c >> 16, 128 | (c >> 10 & 63),
237, 176 | (c >> 6 & 15), 128 | (c & 63))
sdb.write(bytes(data))
# sdb.write(b"\0")
#pos2 = sdb.tell()
#sdb.seek(pos)
#print("•", sdb.data.read(pos2 - pos).hex())
#assert sdb.tell() == pos2
print(sdb.getvalue().hex())
MUTF8("\U0001f4cb")
it will print: eda0bdedb38b, instead of the erroneous eda1bdedb38b
I tried one of the examples from https://docs.rs/residua-mutf8/latest/mutf8/ and I can see that the two implementations are giving different results. The Rust version converts \U00010401
to b'\xed\xa0\x81\xed\xb0\x81'
whereas the mutf8
python package gives b'\xed\xa1\x81\xed\xb0\x81'
as the result:
>>> bla = '\U00010401'
>>> mutf8.encode_modified_utf8(bla)
b'\xed\xa1\x81\xed\xb0\x81'
When testing with some data from Android ( https://android.googlesource.com/platform/development/+/63bf1087ebb06b59e3d82cbc5ccd4485704c6b91/vndk/tools/definition-tool/tests/test_dex_file.py#29 ) I see the same thing happen. So it seems that @VectorASD is correct that there is an error.
I tried one of the examples from https://docs.rs/residua-mutf8/latest/mutf8/ and I can see that the two implementations are giving different results. The Rust version converts
\U00010401
tob'\xed\xa0\x81\xed\xb0\x81'
whereas themutf8
python package givesb'\xed\xa1\x81\xed\xb0\x81'
as the result:>>> bla = '\U00010401' >>> mutf8.encode_modified_utf8(bla) b'\xed\xa1\x81\xed\xb0\x81'
When testing with some data from Android ( https://android.googlesource.com/platform/development/+/63bf1087ebb06b59e3d82cbc5ccd4485704c6b91/vndk/tools/definition-tool/tests/test_dex_file.py#29 ) I see the same thing happen. So it seems that @VectorASD is correct that there is an error.
It seems that the problem is in the encoding, not the decoding:
>>> bla = '\U00010401'
>>> encoded = mutf8.encode_modified_utf8(bla)
>>> encoded
b'\xed\xa1\x81\xed\xb0\x81'
>>> encoded2 = b'\xed\xa0\x81\xed\xb0\x81'
>>> encoded == encoded2
False
>>> bla == mutf8.decode_modified_utf8(encoded)
True
>>> bla == mutf8.decode_modified_utf8(encoded2)
True
While decoding a 6-byte value, you have "0x10000 |". It's not right to do so. Due to the fact that the usual unicode in the construction 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx allows you to encode 21 bits, in MUTF8 you only have 20 bits available, so you need to ADD 0x10000, and not turn the OR operation. In coding, these 0x10000 are not taken into account at all. Try to encode for example "📋" yourself, and then decode it. As a result, we get what 🥴.