FasterXML / jackson-dataformats-binary

Uber-project for standard Jackson binary format backends: avro, cbor, ion, protobuf, smile
Apache License 2.0
310 stars 133 forks source link

[avro] Invalid Union index (-40); union only has 2 types #123

Open vicenteg opened 6 years ago

vicenteg commented 6 years ago

Unsure if I'm doing something wrong here. I want to deserialize Avro to a Json string.

I've boiled my issue down to the following:

  public static void main(String[] args) {
    String inputFile = "test.avro";
    MappingIterator<JsonNode> it = null;

    try {
      Schema jsonSchema =
          new Schema.Parser().setValidate(true).parse(new File(inputFile + ".schema"));
      AvroSchema schema = new AvroSchema(jsonSchema);

      AvroMapper avroMapper = new AvroMapper();
      avroMapper.schemaFrom(new File(inputFile + ".schema"));
      it = avroMapper.readerFor(JsonNode.class).with(schema).readValues(new FileInputStream(inputFile));
    } catch (IOException ex) {
      System.err.println("Could not open " + inputFile + " : " + ex.getMessage());

    while (it.hasNext()) {
      JsonNode row =;

I get an exception:

Exception in thread "main" java.lang.RuntimeException: Invalid Union index (-40); union only has 2 types
        at test.AvroReadToJsonNode.main(
Caused by: Invalid Union index (-40); union only has 2 types
        at com.fasterxml.jackson.dataformat.avro.deser.ScalarDecoder$ScalarUnionDecoder$FR._checkIndex(
        at com.fasterxml.jackson.dataformat.avro.deser.ScalarDecoder$ScalarUnionDecoder$FR.readValue(
        at com.fasterxml.jackson.dataformat.avro.deser.RecordReader$Std.nextToken(
        at com.fasterxml.jackson.dataformat.avro.deser.AvroParserImpl.nextToken(
        at com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeObject(
        at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(
        at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(
        at com.fasterxml.jackson.databind.MappingIterator.nextValue(

The schema looks like this:

  "type" : "record",
  "name" : "test",
  "namespace" : "test.test.avro",
  "doc" : "",
  "fields" : [ {
    "name" : "some_string",
    "type" : [ "null", "string"]
  } ]

And I generated data from the schema using avrotools:

avrotools random --schema-file test.avro.schema --count 100 test.avro
cowtowncoder commented 6 years ago

And this is with which Jackson version?

vicenteg commented 6 years ago


cowtowncoder commented 6 years ago

Ok. So reproduction is almost complete, one missing piece being the encoded input file. I think that is needed as presumably module would not write such content.

I am guessing this might be due to one unfortunate design by Avro authors, however... format is different when stored in a file compared to when encoded for transmission. If so, it will start with a marker and schema as json. Given lack of any metadata in encoding, this is not possible to reliably auto-detect; and it seems strange to require codecs to be aware of input source. At the moment this module does not have special handling for this prefix, although I think there is an issue for requesting implementation.

It should be relatively easy to check if input might be of this form: Avro specification outlines how the headers looks like:

I think this is one of badly designed bad of specification and wonder what authors were smoking. But it is what it is.

vicenteg commented 6 years ago

For the encoded input file, you can use avrotools random to generate some data. I used a command line like the following:

avrotools random --schema-file test.avro.schema --count 100 test.avro

Here's a link to a sample file:

cowtowncoder commented 6 years ago

Yes, that does start with Obj signature indicating Object Container addition, with signature followed by JSON-encoded embedded schema.

So as things are, Object Container files are not supported, only raw encoded content. Issue #8 is about adding support for handling this case (both reading and writing).