Closed Sajjon closed 5 years ago
A colleague pointed out that this is a bug in Jackson and probably on this line
Interesting. I just returned from vacation and didn't have time to check the Jackson code, thanks for pointing this out.
But first let me point out that 0x7817 is indeed a well-formed way to start a 23-byte text string. It is just not what we would call the "preferred encoding", which is one byte shorter: 0x77. Encoders are free to choose non-preferred encodings, if that is expedient to them (unless disallowed by some other constraint such as the need for deterministic ["canonical"] encoding). So Jackson's behavior is innocuous, but may be confusing if Jackson otherwise tends to output preferred encoding. Fixing the behavior does not increase the cost of encoding or the code complexity, so that would be a good choice.
The complexity you see here in Jackson comes from needing to convert from Java's internal UTF-16-based text string encoding to UTF-8; Jackson attempts to minimize the number of copy operations needed. Jackson guesses that most text strings will be ASCII-only and allocates a single byte of head for text strings with 23 or less characters. I then checks whether the guess was right, but with the off-by-one error your colleague noted: It checks for less than 23 bytes (and not for 23 or less bytes). If the check fails, it goes to 0x78nn, which is what we therefore see for the 23-byte string as well.
Will you report the bug?
@cabo Thx for your long reply. Yes over the Easter weekend I realized that it is not an error on Jackson's part, but as you said, just not the canonical form.
I can report the bug in the Jackson repo and close this issue here.
Thx!
Jackson 2.9.9 will fix the issue (see https://github.com/FasterXML/jackson-dataformats-binary/issues/159) with lengths 23 and 255 bytes, respectively.
Hello! We at Radix are using CBOR. I just found strange behavior in encoding and decoding of this UTF string:
"radix.particles.message"
The Cbor playground cbor.me encodes that into the byte string
Which also both SwiftCBOR and node-cbor do.
But Java Jacksson Cbor encodes it to:
Notice the difference, the prefix
77
versus7817
.The interesting part is that both Cbor.me and SwiftCBOR decodes
781772616469782e7061727469636c65732e6d657373616765
into the same UTF string:"radix.particles.message"
.In other words, two different byte strings decode into the same UTF string.
Is that correct?
According to the RFC 7049, the major type is 3, i.e.
0b011
followed by the length of that string, which is 23 is decimal, which results in0b10111
in binary, and01110111
in hex is77
.So where does
7817
come from? And why does it correctly decode into the same UTF string?