FasterXML / jackson-dataformats-binary

Uber-project for standard Jackson binary format backends: avro, cbor, ion, protobuf, smile
Apache License 2.0
316 stars 136 forks source link

Avro does not respect default values defined in schema #416

Open basimons opened 11 months ago

basimons commented 11 months ago

Hello,

I encountered something strange while doing some tests with the avro decoding.

Example here, was ran in version 2.16.0:


 String avroWithDefault = """
        {
        "type": "record",
        "name": "Employee",
        "fields": [
         {"name": "name", "type": ["string", "null"], "default" : "bram"},
         {"name": "age", "type": "int"},
         {"name": "emails", "type": {"type": "array", "items": "string"}},
         {"name": "boss", "type": ["Employee","null"]}
        ]}
        """;

// Notice no name field
String employeeJson = """
{
    "age" : 26,
    "emails" : ["test@test.com"],
    "boss" : {
         "name" : "test",
         "age" : 33,
         "emails" : ["test@test.com"]
    }
}
""";

SchemaFormat schema = new AvroMapper().schemaFrom(avroWithDefault);
JsonNode jsonObject = new ObjectMapper().reader().readTree(payload);
byte[] objectAsBytes = new AvroMapper().writer().with(formatSchema).writeValueAsBytes(jsonObject);

// Decode it again
JsonNode decodedObject = new AvroMapper().reader(schema).readTree(payload);

System.out.println(decodedObject.toString());

If you look at this object you see that the default value is not filled. It is just a null, all the other fields are filled just as expected. I tried this with different schemas and not having a union with a null, but just the default, but that would result in a JsonMappingException.

Am I doing something wrong here, or is this not supported? It doesn't say that it does not support default values like it says in the protobuffer one.

Thanks in advance

EDIT: This makes sense that it does not work, as you cannot write a AVRO file with a default without a value for it. I think it should've thrown an error on writing. But the main question is why it doesn't work with a reading schema that has a default, but a writing schema that does have one. See my other question.

cowtowncoder commented 11 months ago

I think this is not supported, at least with Jackson's native Avro read implementation. Apache Avro-lib -backed variant, while slower, might handle default values correctly.

As to how to enable Apache Avro lib backend, I think there are unit tests that do that.

I agree, it'd be good to document this gap.

basimons commented 11 months ago

Thanks for your response.

I tried looking for a unit test, but I couldn't find one. I did however find the ApacheAvroparserImpl. When I implemented it like this:

  try (AvroParser parser =new ApacheAvroFactory(new AvroMapper()).createParser(payload)) {
            parser.setSchema(schema);

            TreeNode treeNode = parser.readValueAsTree();
            System.out.println(treeNode);
        };

It does not work unfortunately (as in no default values). Am I doing it correctly or should I also use a different codec?

basimons commented 11 months ago

I made some changes, as of course the code that I showed in my first message does not fully make sense. You cannot not write a value, even if it has a default. So I changed it to this:

 String writingSchema = """
        {
        "type": "record",
        "name": "Employee",
        "fields": [
         {"name": "age", "type": "int"},
         {"name": "emails", "type": {"type": "array", "items": "string"}},
         {"name": "boss", "type": ["Employee","null"]}
        ]}
        """;

        String readingSchema = """
        {
        "type": "record",
        "name": "Employee",
        "fields": [
         {"name": "name", "type": ["string", "null"], "default" : "bram"},
         {"name": "age", "type": "int"},
         {"name": "emails", "type": {"type": "array", "items": "string"}},
         {"name": "boss", "type": ["Employee","null"]}
        ]}
        """;

        String employeeJson = """
            {
                "age" : 26,
                "emails" : ["test@test.com", "test@test.com"],
                "boss" : {
                    "age" : 33,
                    "emails" : ["test@test.blockbax.com"]
                }
            }
            """;

When I do this, when I read the values, I get the following exception: java.io.IOException: Invalid Union index (26); union only has 2 types. Which is the same as reported here: https://github.com/FasterXML/jackson-dataformats-binary/issues/164

cowtowncoder commented 11 months ago

The only other note I have is that this:

new ApacheAvroFactory(new AvroMapper()).

is wrong way around: it should be

new AvroMapper(new ApacheAvroFactory)

to have correct linking; and then you should be able to create ObjectReader / ObjectWriter through which you can assign schema.

But I suspect that won't change things too much: you should either way have ApacheAvroFactory that is using Apache Avro lib.

basimons commented 11 months ago

Ah thanks, didn't know that. I tried it, but as you said it did indeed not work.

Whats weird, I even tried decoding it with the apache avro library myself. I just used GenericDatumReader (and all things that come with it), but I would get exactly the same error. This does not make sense right? As I'm sure that what I'm doing is allowed by Avro (adding a default field in a reader schema, that is not in the write schema), as I have done it many times in my Kafka cluster.

Do you happen to know what the difference might be? Do my Kafka clients do anything special for this?

basimons commented 11 months ago

I finally get it. In your kafka cluster it saves the writing schema with it. If you parse it like this:

  Schema avroSchema = ((AvroSchema) schema).getAvroSchema();
        GenericDatumReader<GenericRecord> objectGenericDatumReader = new GenericDatumReader<>(writingschema, avroSchema);

BinaryDecoder binaryDecoder = DecoderFactory.get().binaryDecoder(payload, null);
GenericRecord read = objectGenericDatumReader.read(null, binaryDecoder);

So with the specific writer schema.

It does work. Normally kafka does it this way for you, but I don't think the AvroMapper has a way do to it with 2 schemas.

cowtowncoder commented 11 months ago

@basimons Avro module does indeed allow use of 2 schema (read/write) configuration -- it's been a while so I'll have to see how it was done. I think AvroMapper has methods to construct Jackson AvroSchema from 2 separate schemas.

cowtowncoder commented 11 months ago

Ah. Close: AvroSchema has method withReaderSchema(AvroSchema rs) where you get both schema instances, then call method on "writer schema" (one used on writing records). From ArrayEvolutionTest:

        final AvroSchema srcSchema = MAPPER.schemaFrom(SCHEMA_XY_ARRAY_JSON);
        final AvroSchema dstSchema = MAPPER.schemaFrom(SCHEMA_XYZ_ARRAY_JSON);
        final AvroSchema xlate = srcSchema.withReaderSchema(dstSchema);

and then you construct ObjectReader as usual.