FasterXML / jackson-dataformats-binary

Uber-project for standard Jackson binary format backends: avro, cbor, ion, protobuf, smile
Apache License 2.0
310 stars 133 forks source link

Short NUL-only keys incorrectly detected as duplicates in CBOR and SMILE #312

Closed DaveCTurner closed 2 years ago

DaveCTurner commented 2 years ago

The following object apparently cannot be deserialised in CBOR and SMILE:

{"\u0000":"","\u0000\u0000":""}

The following are base64-encoded copies of its representation in various formats:

JSON: eyJcdTAwMDAiOiIiLCJcdTAwMDBcdTAwMDAiOiIifQ==
YAML: LS0tCiJcMCI6ICIiCiJcMFwwIjogIiIK
SMILE: OikKBfqAACCBAAAg+w==
CBOR: v2EAYGIAAGD/

As far as I can tell these all look correct, but on deserialization the field names are reported as duplicates:

com.fasterxml.jackson.core.JsonParseException: Duplicate field ''
 at [Source: (byte[])":)
�� � �"; line: -1, column: 11]

    at __randomizedtesting.SeedInfo.seed([29F6747C61080477:49491DCB06A6B558]:0)
    at com.fasterxml.jackson.core.json.JsonReadContext._checkDup(JsonReadContext.java:204)
    at com.fasterxml.jackson.core.json.JsonReadContext.setCurrentName(JsonReadContext.java:198)
    at com.fasterxml.jackson.dataformat.smile.SmileParser._handleFieldName(SmileParser.java:1346)
    at com.fasterxml.jackson.dataformat.smile.SmileParser.nextToken(SmileParser.java:374)
    at com.fasterxml.jackson.core.JsonGenerator._copyCurrentContents(JsonGenerator.java:1935)
    at com.fasterxml.jackson.core.JsonGenerator.copyCurrentStructure(JsonGenerator.java:1914)
    at org.elasticsearch.xcontent.json.JsonXContentGenerator.copyCurrentStructure(JsonXContentGenerator.java:396)

Relates https://github.com/elastic/elasticsearch/issues/84146

DaveCTurner commented 2 years ago

No idea how one might fix this but here's a test for it at least (against 45e87802207a490809a2729740e0583d75b81ead):

diff --git a/smile/src/test/java/com/fasterxml/jackson/dataformat/smile/parse/BasicParserTest.java b/smile/src/test/java/com/fasterxml/jackson/dataformat/smile/parse/BasicParserTest.java
index 26bfd377..a31feba8 100644
--- a/smile/src/test/java/com/fasterxml/jackson/dataformat/smile/parse/BasicParserTest.java
+++ b/smile/src/test/java/com/fasterxml/jackson/dataformat/smile/parse/BasicParserTest.java
@@ -478,6 +478,30 @@ public class BasicParserTest extends BaseTestForSmile
         p.close();
     }

+    public void testNULInFieldNames() throws IOException
+    {
+        ByteArrayOutputStream bytes = new ByteArrayOutputStream(1000);
+        JsonGenerator jgen = new SmileFactory().createGenerator(bytes);
+        jgen.writeStartObject();
+        for (String fieldName = ""; fieldName.length() < 20; fieldName += "\0") {
+            jgen.writeStringField(fieldName, "");
+        }
+        jgen.writeEndObject();
+        jgen.close();
+
+        SmileParser p = _smileParser(bytes.toByteArray());
+
+        assertToken(JsonToken.START_OBJECT, p.nextToken());
+        for (String fieldName = ""; fieldName.length() < 20; fieldName += "\0") {
+            assertToken(JsonToken.FIELD_NAME, p.nextToken());
+            assertEquals(fieldName, p.getCurrentName());
+            assertToken(JsonToken.VALUE_STRING, p.nextToken());
+            assertEquals("", p.getValueAsString());
+        }
+        assertToken(JsonToken.END_OBJECT, p.nextToken());
+        assertNull(p.nextToken());
+        p.close();
+    }

     public void testBufferRelease() throws IOException
     {
cowtowncoder commented 2 years ago

@DaveCTurner first of all, thank you for reporting this and especially contributing a test to reproduce!

I have a suspicion that the issue is with zero-padding of symbol table used; I vaguely recall something similar with JSON (with escaping, null bytes are legal in JSON names too). But since Smile and CBOR codecs are bit different (with more straight-forward name decoding due to lack of escaping) it is possible that handling has non working edge case.

But the gist of this is that symbol table lookups are keyed by "quads" (int containing up to 4 bytes from name; last quad possibly being incomplete). Special padding is used to avoid conflicts for case of "unused" bits at the end; this only occurs in case of raw 0 bytes. I forget the exact details of how this was handled in JSON but given time should be able to refresh my memory.

Now: I hope to look into this in near future, but I do have bit of a backlog. I just wanted to add a note saying above. :)

cowtowncoder commented 2 years ago

Ok, yes, I can reproduce this. Somehow looks like two-null-byte case gets decoded as single null byte, if I got that right.

cowtowncoder commented 2 years ago

Interestingly enough, JSON parser from jackson-core does NOT suffer from this. Could be due to escaping required, maybe decoding goes through different route.

Actually, hmmh. Yes, it goes through different decoder but ends up with similar look up, with a call to _padLastQuad; to fix:

https://github.com/FasterXML/jackson-core/issues/148

and looks like JSON backend has no mismatch. Need to see if something similar would help with CBOR, Smile.

cowtowncoder commented 2 years ago

Hmmmh. Ok... this is tricky. Code in Smile, CBOR codecs does not keep track of "unused" bytes in quads, unlike JSON codec. But we do need to distinguish between "unused" and 0 bytes since only latter is part of actual decoded value. Value of 0xFF (or -1 as signed) byte works because it is not a legal UTF-8 byte; that's what JSON codec uses.

Two ways to go about it, really:

  1. Fill last possibly incomplete quad with -1s when starting (so anything unshifted remains so)
  2. Pad at the end

JSON codec uses (2) for what that is worth.

DaveCTurner commented 2 years ago

:+1: that ties in with what I thought I was seeing yes. I'd completely forgotten that 0xFF wasn't a legal byte here, but yes that's just what we need. Option 1 seems simpler:

diff --git a/smile/src/main/java/com/fasterxml/jackson/dataformat/smile/SmileParser.java b/smile/src/main/java/com/fasterxml/jackson/dataformat/smile/SmileParser.java
index ea4fe91c..72560434 100644
--- a/smile/src/main/java/com/fasterxml/jackson/dataformat/smile/SmileParser.java
+++ b/smile/src/main/java/com/fasterxml/jackson/dataformat/smile/SmileParser.java
@@ -1825,7 +1825,7 @@ versionBits));
         if (len < 5) {
             int inPtr = _inputPtr;
             final byte[] inBuf = _inputBuffer;
-            int q = inBuf[inPtr] & 0xFF;
+            int q = inBuf[inPtr] & 0xFF | 0xFFFFFF00;
             if (len > 1) {
                 q = (q << 8) + (inBuf[++inPtr] & 0xFF);
                 if (len > 2) {
@@ -1849,7 +1849,7 @@ versionBits));
         q1 =  (q1 << 8) | (inBuf[inPtr++] & 0xFF);

         if (len < 9) {
-            int q2 = (inBuf[inPtr++] & 0xFF);
+            int q2 = (inBuf[inPtr++] & 0xFF) | 0xFFFFFF00;
             int left = len - 5;
             if (left > 0) {
                 q2 = (q2 << 8) + (inBuf[inPtr++] & 0xFF);
@@ -1871,7 +1871,7 @@ versionBits));
         q2 =  (q2 << 8) | (inBuf[inPtr++] & 0xFF);

         if (len < 13) {
-            int q3 = (inBuf[inPtr++] & 0xFF);
+            int q3 = (inBuf[inPtr++] & 0xFF) | 0xFFFFFF00;
             int left = len - 9;
             if (left > 0) {
                 q3 = (q3 << 8) + (inBuf[inPtr++] & 0xFF);
@@ -1920,7 +1920,7 @@ versionBits));
         } while ((len -= 4) > 3);
         // and then leftovers
         if (len > 0) {
-            int q = inBuf[inPtr] & 0xFF;
+            int q = inBuf[inPtr] & 0xFF | 0xFFFFFF00;
             if (len > 1) {
                 q = (q << 8) + (inBuf[++inPtr] & 0xFF);
                 if (len > 2) {

That fixes it for me, but I see other similar code that might also have the same problem.

cowtowncoder commented 2 years ago

@DaveCTurner ah. I was looking at CBOR end of things. There are indeed a few other code paths to consider... partly for different lengths, partly for boundary conditions. So I'll need to improve the tests I think.

I'll see if I can fix this but just for sake of safety, for 2.14 branch. I have a feeling that there's non-trivial chance for regression for either correctness or performance here.

cowtowncoder commented 2 years ago

Ok, fixed; yes, pre-padding is probably bit simpler although both work (and had to use post-padding for one case in Smile). Curious as to whether there's any measurable effect on decoding speed; probably worth checking 2.13 vs 2.14 decoding speed with jackson-benchmarks (https://github.com/FasterXML/jackson-benchmarks/)