Open laymain opened 6 years ago
Due to #92 the generated schema for the Spark Row has all its fields nullable, which gives the following output schema:
{
"type": "record",
"name": "topLevelRecord",
"fields": [{
"name": "properties",
"type": [{
"type": "map",
"values": [{
"type": "array",
"items": [{
"type": "record",
"name": "properties",
"fields": [{
"name": "string",
"type": ["string", "null"]
}
]
}, "null"]
}, "null"]
}, "null"]
}
]
}
When I try to use this schema with the json object to generate a new avro file:
avro-tool fromjson --schema-file schema-generated.json event.json > event-generated.avro
I get the following error:
Exception in thread "main" org.apache.avro.AvroTypeException: Unknown union branch object at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:445) at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267) at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222) at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145) at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:99) at org.apache.avro.tool.Main.run(Main.java:87) at org.apache.avro.tool.Main.main(Main.java:76)
Is the generated schema invalid or is it a bug in Avro?
It is linked to the nullable issue (#92), the expected new input data for the generated schema is:
{
"properties": {
"map": {
"object": {
"array": [
{"properties": {"string": {"string": "one"}}},
{"properties": {"string": {"string": "two"}}}
]
}
}
}
}
instead of the initial input
{
"properties": {
"object": [
{ "string": "one" },
{ "string": "two" }
]
}
}
I found a workaround by using the schema to generate the right StructType with SchemaConverters and creating a new dataset with this StructType:
public static void main(String[] args) throws Exception {
Schema schema = new Schema.Parser().parse(Main.class.getClassLoader().getResourceAsStream("schema.json"));
DataType dataType = SchemaConverters.toSqlType(schema).dataType();
StructType structType = (StructType)dataType;
URL avroResource = Main.class.getClassLoader().getResource("event.avro");
if (avroResource == null) {
throw new RuntimeException("Missing resource event.avro");
}
SparkSession sparkSession = SparkSession.builder()
.appName("com.laymain.sandbox.avro")
.master("local[*]")
.getOrCreate();
Dataset<Row> dataset = sparkSession
.read()
.format("com.databricks.spark.avro")
.load(avroResource.getPath());
dataset
.sqlContext()
.createDataFrame(dataset.rdd(), structType)
.write()
.mode(SaveMode.Overwrite)
.format("com.databricks.spark.avro")
.save("output");
}
Hi,
Spark-avro fails to write a record that contains map of array of record with the following error:
schema.json
event.json
Avro file generated using avro-tools:
java -jar avro-tools-1.8.2.jar --schema-file schema.json event.json > event.avro
Spark code: