Closed jviotti closed 3 years ago
Alternatively, if the specification is correct, could it be that pysmile
is doing things wrong?
Thank you for submitting this. I will need to re-read specification to remember exact details to answer the question.
No hurries. Let me know if there is anything I can do to help.
@jviotti I think pysmile
is encoding it incorrectly; implementor possibly not paying close attention to the specification: I agree that prefix should be 0x82
, not 0x83
. I'll double-check this with Jackson Smile format implementation to be sure.
If so, this is pretty unfortunate as that would make resulting content invalid and incompatible with other implementations.
I wish spec had an explicit node that actual length for ASCII Strings is 1 + length-indicator
, and for not-necessarily-ASCII 2 + length-indicator
, to make the discrepancy explicit and not just implicit from reading.
Yeah, that sounds quite bad. If PySmile interpreted the spec incorrectly, then chances are that other implementations could have made the same mistake.
I wish spec had an explicit node that actual length for ASCII Strings is 1 + length-indicator, and for not-necessarily-ASCII 2 + length-indicator, to make the discrepancy explicit and not just implicit from reading.
I'm happy to send a PR adding this later today.
Closing as per discussion: thank you for spec note updates!
Agree wrt PySmile: there really should be sample encoded content to test implementations against; that would help improve correctness, interoperability.
I'm running
pysmile
0.2 and I'm seeing Tiny Unicode strings being encoded slightly different to what the specification describes.Encoding a UTF-8 string consisting of 4 characters (i.e. the string
data
) is encoded as follows:The hexadecimal string
6461 7461
equalsdata
. The prefix is 0x83.The specification describes that 0x80 would encode a UTF-8 string with 1 character, however this is not allowed as a single-byte character string must be an ASCII string, and therefore the ASCII encoding is preferred. Therefore 0x81 describes a UTF-8 string with 2 bytes, 0x82 describes a UTF-8 string with 3 bytes, and of course 0x83 maps to a UTF-8 string with 4 bytes, which is what I get when encoding my string with
pysmile
.The problem is that following that logic, the LSB of the Tiny Unicode prefix equals the byte-length of the string - 1 (i.e. 0x83 =
10000011
, the 5-bit LSB is 3, so the byte length is 4). The upper bounds of the Tiny Unicode range is 0x9f =10011111
, where the 5-bit LSB is 31. This maps to a 32 byte UTF-8 string, so the largest UTF-8 string that the Tiny Unicode encoding can handle is NOT 33 bytes as described by the specification, but 32.As a consequence, the Short Unicode encoding starts from 33 and not from 34.
Signed-off-by: Juan Cruz Viotti jv@jviotti.com