Different Byte strings decodes into same UTF string, is that correct?

Sajjon commented 5 years ago

Hello! We at Radix are using CBOR. I just found strange behavior in encoding and decoding of this UTF string: "radix.particles.message"

The Cbor playground cbor.me encodes that into the byte string

77
72616469782e7061727469636c65732e6d657373616765

Which also both SwiftCBOR and node-cbor do.

But Java Jacksson Cbor encodes it to:

7817
72616469782e7061727469636c65732e6d657373616765

Notice the difference, the prefix 77 versus 7817.

The interesting part is that both Cbor.me and SwiftCBOR decodes 781772616469782e7061727469636c65732e6d657373616765 into the same UTF string: "radix.particles.message".

In other words, two different byte strings decode into the same UTF string.

Is that correct?

According to the RFC 7049, the major type is 3, i.e. 0b011 followed by the length of that string, which is 23 is decimal, which results in 0b10111 in binary, and 01110111 in hex is 77.

So where does 7817 come from? And why does it correctly decode into the same UTF string?

Sajjon commented 5 years ago

A colleague pointed out that this is a bug in Jackson and probably on this line

cabo commented 5 years ago

Interesting. I just returned from vacation and didn't have time to check the Jackson code, thanks for pointing this out.

But first let me point out that 0x7817 is indeed a well-formed way to start a 23-byte text string. It is just not what we would call the "preferred encoding", which is one byte shorter: 0x77. Encoders are free to choose non-preferred encodings, if that is expedient to them (unless disallowed by some other constraint such as the need for deterministic ["canonical"] encoding). So Jackson's behavior is innocuous, but may be confusing if Jackson otherwise tends to output preferred encoding. Fixing the behavior does not increase the cost of encoding or the code complexity, so that would be a good choice.

The complexity you see here in Jackson comes from needing to convert from Java's internal UTF-16-based text string encoding to UTF-8; Jackson attempts to minimize the number of copy operations needed. Jackson guesses that most text strings will be ASCII-only and allocates a single byte of head for text strings with 23 or less characters. I then checks whether the guess was right, but with the off-by-one error your colleague noted: It checks for less than 23 bytes (and not for 23 or less bytes). If the check fails, it goes to 0x78nn, which is what we therefore see for the 23-byte string as well.

Will you report the bug?

Sajjon commented 5 years ago

@cabo Thx for your long reply. Yes over the Easter weekend I realized that it is not an error on Jackson's part, but as you said, just not the canonical form.

I can report the bug in the Jackson repo and close this issue here.

Thx!

cowtowncoder commented 5 years ago

Jackson 2.9.9 will fix the issue (see https://github.com/FasterXML/jackson-dataformats-binary/issues/159) with lengths 23 and 255 bytes, respectively.

cbor / cbor.github.io

Different Byte strings decodes into same UTF string, is that correct? #49