PyUnicode_New needs to know the maximum code point to be placed in the new unicode object. An upper bound of the maximum code point in a utf-8 string is
1114111 if m >= 0xf0
65535 if 0xc4 <= m < 0xf0
255 if m < 0xc4
where m is the maximum byte in the string.
This commit changes the string deserializer to find the upper bound based on the above formulation, using str::bytes().max(). The rust compiler vectorizes str::bytes().max() (PMAXUB on x86_64, UMAX and UMAXV on arm64), so there is no need to manually vectorize the operation using std::arch or std::simd.
PyUnicode_New needs to know the maximum code point to be placed in the new unicode object. An upper bound of the maximum code point in a utf-8 string is
where m is the maximum byte in the string.
This commit changes the string deserializer to find the upper bound based on the above formulation, using str::bytes().max(). The rust compiler vectorizes str::bytes().max() (PMAXUB on x86_64, UMAX and UMAXV on arm64), so there is no need to manually vectorize the operation using std::arch or std::simd.