aviramha / ormsgpack

Msgpack serialization/deserialization library for Python, written in Rust using PyO3 and rust-msgpack. Reboot of orjson. msgpack.org[Python]
Apache License 2.0
255 stars 16 forks source link

Optimize deserialization of strings #298

Closed exg closed 3 weeks ago

exg commented 3 weeks ago

PyUnicode_New needs to know the maximum code point to be placed in the new unicode object. An upper bound of the maximum code point in a utf-8 string is

where m is the maximum byte in the string.

This commit changes the string deserializer to find the upper bound based on the above formulation, using str::bytes().max(). The rust compiler vectorizes str::bytes().max() (PMAXUB on x86_64, UMAX and UMAXV on arm64), so there is no need to manually vectorize the operation using std::arch or std::simd.