Closed DaveCTurner closed 2 years ago
No idea how one might fix this but here's a test for it at least (against 45e87802207a490809a2729740e0583d75b81ead):
diff --git a/smile/src/test/java/com/fasterxml/jackson/dataformat/smile/parse/BasicParserTest.java b/smile/src/test/java/com/fasterxml/jackson/dataformat/smile/parse/BasicParserTest.java
index 26bfd377..a31feba8 100644
--- a/smile/src/test/java/com/fasterxml/jackson/dataformat/smile/parse/BasicParserTest.java
+++ b/smile/src/test/java/com/fasterxml/jackson/dataformat/smile/parse/BasicParserTest.java
@@ -478,6 +478,30 @@ public class BasicParserTest extends BaseTestForSmile
p.close();
}
+ public void testNULInFieldNames() throws IOException
+ {
+ ByteArrayOutputStream bytes = new ByteArrayOutputStream(1000);
+ JsonGenerator jgen = new SmileFactory().createGenerator(bytes);
+ jgen.writeStartObject();
+ for (String fieldName = ""; fieldName.length() < 20; fieldName += "\0") {
+ jgen.writeStringField(fieldName, "");
+ }
+ jgen.writeEndObject();
+ jgen.close();
+
+ SmileParser p = _smileParser(bytes.toByteArray());
+
+ assertToken(JsonToken.START_OBJECT, p.nextToken());
+ for (String fieldName = ""; fieldName.length() < 20; fieldName += "\0") {
+ assertToken(JsonToken.FIELD_NAME, p.nextToken());
+ assertEquals(fieldName, p.getCurrentName());
+ assertToken(JsonToken.VALUE_STRING, p.nextToken());
+ assertEquals("", p.getValueAsString());
+ }
+ assertToken(JsonToken.END_OBJECT, p.nextToken());
+ assertNull(p.nextToken());
+ p.close();
+ }
public void testBufferRelease() throws IOException
{
@DaveCTurner first of all, thank you for reporting this and especially contributing a test to reproduce!
I have a suspicion that the issue is with zero-padding of symbol table used; I vaguely recall something similar with JSON (with escaping, null bytes are legal in JSON names too). But since Smile and CBOR codecs are bit different (with more straight-forward name decoding due to lack of escaping) it is possible that handling has non working edge case.
But the gist of this is that symbol table lookups are keyed by "quads" (int
containing up to 4 bytes from name; last quad possibly being incomplete). Special padding is used to avoid conflicts for case of "unused" bits at the end; this only occurs in case of raw 0
bytes.
I forget the exact details of how this was handled in JSON but given time should be able to refresh my memory.
Now: I hope to look into this in near future, but I do have bit of a backlog. I just wanted to add a note saying above. :)
Ok, yes, I can reproduce this. Somehow looks like two-null-byte case gets decoded as single null byte, if I got that right.
Interestingly enough, JSON parser from jackson-core
does NOT suffer from this. Could be due to escaping required, maybe decoding goes through different route.
Actually, hmmh. Yes, it goes through different decoder but ends up with similar look up, with a call to _padLastQuad
; to fix:
https://github.com/FasterXML/jackson-core/issues/148
and looks like JSON backend has no mismatch. Need to see if something similar would help with CBOR, Smile.
Hmmmh. Ok... this is tricky. Code in Smile, CBOR codecs does not keep track of "unused" bytes in quads, unlike JSON codec. But we do need to distinguish between "unused" and 0
bytes since only latter is part of actual decoded value.
Value of 0xFF
(or -1
as signed) byte works because it is not a legal UTF-8 byte; that's what JSON codec uses.
Two ways to go about it, really:
JSON codec uses (2) for what that is worth.
:+1: that ties in with what I thought I was seeing yes. I'd completely forgotten that 0xFF
wasn't a legal byte here, but yes that's just what we need. Option 1 seems simpler:
diff --git a/smile/src/main/java/com/fasterxml/jackson/dataformat/smile/SmileParser.java b/smile/src/main/java/com/fasterxml/jackson/dataformat/smile/SmileParser.java
index ea4fe91c..72560434 100644
--- a/smile/src/main/java/com/fasterxml/jackson/dataformat/smile/SmileParser.java
+++ b/smile/src/main/java/com/fasterxml/jackson/dataformat/smile/SmileParser.java
@@ -1825,7 +1825,7 @@ versionBits));
if (len < 5) {
int inPtr = _inputPtr;
final byte[] inBuf = _inputBuffer;
- int q = inBuf[inPtr] & 0xFF;
+ int q = inBuf[inPtr] & 0xFF | 0xFFFFFF00;
if (len > 1) {
q = (q << 8) + (inBuf[++inPtr] & 0xFF);
if (len > 2) {
@@ -1849,7 +1849,7 @@ versionBits));
q1 = (q1 << 8) | (inBuf[inPtr++] & 0xFF);
if (len < 9) {
- int q2 = (inBuf[inPtr++] & 0xFF);
+ int q2 = (inBuf[inPtr++] & 0xFF) | 0xFFFFFF00;
int left = len - 5;
if (left > 0) {
q2 = (q2 << 8) + (inBuf[inPtr++] & 0xFF);
@@ -1871,7 +1871,7 @@ versionBits));
q2 = (q2 << 8) | (inBuf[inPtr++] & 0xFF);
if (len < 13) {
- int q3 = (inBuf[inPtr++] & 0xFF);
+ int q3 = (inBuf[inPtr++] & 0xFF) | 0xFFFFFF00;
int left = len - 9;
if (left > 0) {
q3 = (q3 << 8) + (inBuf[inPtr++] & 0xFF);
@@ -1920,7 +1920,7 @@ versionBits));
} while ((len -= 4) > 3);
// and then leftovers
if (len > 0) {
- int q = inBuf[inPtr] & 0xFF;
+ int q = inBuf[inPtr] & 0xFF | 0xFFFFFF00;
if (len > 1) {
q = (q << 8) + (inBuf[++inPtr] & 0xFF);
if (len > 2) {
That fixes it for me, but I see other similar code that might also have the same problem.
@DaveCTurner ah. I was looking at CBOR end of things. There are indeed a few other code paths to consider... partly for different lengths, partly for boundary conditions. So I'll need to improve the tests I think.
I'll see if I can fix this but just for sake of safety, for 2.14 branch. I have a feeling that there's non-trivial chance for regression for either correctness or performance here.
Ok, fixed; yes, pre-padding is probably bit simpler although both work (and had to use post-padding for one case in Smile).
Curious as to whether there's any measurable effect on decoding speed; probably worth checking 2.13 vs 2.14 decoding speed with jackson-benchmarks
(https://github.com/FasterXML/jackson-benchmarks/)
The following object apparently cannot be deserialised in CBOR and SMILE:
The following are base64-encoded copies of its representation in various formats:
As far as I can tell these all look correct, but on deserialization the field names are reported as duplicates:
Relates https://github.com/elastic/elasticsearch/issues/84146