Fix Tiny Unicode and Short Unicode byte length ranges

jviotti commented 3 years ago

I'm running pysmile 0.2 and I'm seeing Tiny Unicode strings being encoded slightly different to what the specification describes.

Encoding a UTF-8 string consisting of 4 characters (i.e. the string data) is encoded as follows:

83 6461 7461

The hexadecimal string 6461 7461 equals data. The prefix is 0x83.

The specification describes that 0x80 would encode a UTF-8 string with 1 character, however this is not allowed as a single-byte character string must be an ASCII string, and therefore the ASCII encoding is preferred. Therefore 0x81 describes a UTF-8 string with 2 bytes, 0x82 describes a UTF-8 string with 3 bytes, and of course 0x83 maps to a UTF-8 string with 4 bytes, which is what I get when encoding my string with pysmile.

The problem is that following that logic, the LSB of the Tiny Unicode prefix equals the byte-length of the string - 1 (i.e. 0x83 = 10000011, the 5-bit LSB is 3, so the byte length is 4). The upper bounds of the Tiny Unicode range is 0x9f = 10011111, where the 5-bit LSB is 31. This maps to a 32 byte UTF-8 string, so the largest UTF-8 string that the Tiny Unicode encoding can handle is NOT 33 bytes as described by the specification, but 32.

As a consequence, the Short Unicode encoding starts from 33 and not from 34.

Signed-off-by: Juan Cruz Viotti jv@jviotti.com

jviotti commented 3 years ago

Alternatively, if the specification is correct, could it be that pysmile is doing things wrong?

cowtowncoder commented 3 years ago

Thank you for submitting this. I will need to re-read specification to remember exact details to answer the question.

jviotti commented 3 years ago

No hurries. Let me know if there is anything I can do to help.

cowtowncoder commented 3 years ago

@jviotti I think pysmile is encoding it incorrectly; implementor possibly not paying close attention to the specification: I agree that prefix should be 0x82, not 0x83. I'll double-check this with Jackson Smile format implementation to be sure. If so, this is pretty unfortunate as that would make resulting content invalid and incompatible with other implementations.

I wish spec had an explicit node that actual length for ASCII Strings is 1 + length-indicator, and for not-necessarily-ASCII 2 + length-indicator, to make the discrepancy explicit and not just implicit from reading.

jviotti commented 3 years ago

Yeah, that sounds quite bad. If PySmile interpreted the spec incorrectly, then chances are that other implementations could have made the same mistake.

I wish spec had an explicit node that actual length for ASCII Strings is 1 + length-indicator, and for not-necessarily-ASCII 2 + length-indicator, to make the discrepancy explicit and not just implicit from reading.

I'm happy to send a PR adding this later today.

jviotti commented 3 years ago

See https://github.com/FasterXML/smile-format-specification/pull/11

cowtowncoder commented 3 years ago

Closing as per discussion: thank you for spec note updates!

Agree wrt PySmile: there really should be sample encoded content to test implementations against; that would help improve correctness, interoperability.

FasterXML / smile-format-specification

Fix Tiny Unicode and Short Unicode byte length ranges #9