FasterXML / jackson-dataformats-binary

Uber-project for standard Jackson binary format backends: avro, cbor, ion, protobuf, smile
Apache License 2.0
310 stars 133 forks source link

Invalid Avro file produced using SequenceWriter #339

Open willsoto opened 2 years ago

willsoto commented 2 years ago

While documentation on writing Avro to a file is sparse, I have managed to piece some stuff together but I am still getting an error.

Here is some sample code:

final var avroFactory = AvroFactory.builderWithApacheDecoder().enable(AvroGenerator.Feature.AVRO_FILE_OUTPUT).build();

final var generator = new AvroSchemaGenerator().enableLogicalTypes();

final var mapper = AvroMapper.builder(avroFactory).addModule(new AvroJavaTimeModule()).build();
mapper.acceptJsonFormatVisitor(Thing.class, generator);

final var avroSchema = generator.getGeneratedSchema();

final var file = Files.createTempFile("something", ".avro").toFile();

final var out = new ByteArrayOutputStream();
final var writer = mapper.writer(avroSchema).writeValues(out);

// in a loop
writer.write(thing);

// after loop
writer.close();

try (FileOutputStream outputStream = new FileOutputStream(file)) {
  out.writeTo(outputStream);
}

When checking the resultant file using avro-tools, I get the following error:

avro-tools tojson something.avro

22/09/08 18:36:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
    at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:224)
    at org.apache.avro.tool.DataFileReadTool.run(DataFileReadTool.java:97)
    at org.apache.avro.tool.Main.run(Main.java:67)
    at org.apache.avro.tool.Main.main(Main.java:56)
Caused by: java.io.IOException: Invalid sync!
    at org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:319)
    at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:213)
    ... 3 mor

According to some searching, the Invalid sync! error occurs when the file hasn't been stitched together properly, but it's unclear to me what I need to do in code to get that to happen. I've looked through most of the Avro tests in this repo and I cannot find one that actually writes to a file and then de-serializes from that file.

I am not sure if I have stumbled into an actual bug here or not, but I am happy to try and write a test case if this code does seem correct since that would imply it's a bug?

Thanks in advance.

Edit:

I've also tried the following:

final var file = Files.createTempFile("something", ".avro").toFile();
final SequenceWriter writer = mapper.writer(avroSchema).writeValues(file);

In which case I get the following error at that line:

java.lang.UnsupportedOperationException: Generator of type com.fasterxml.jackson.core.json.UTF8JsonGenerator does not support schema of type 'avro'

    at com.fasterxml.jackson.core.JsonGenerator.setSchema(JsonGenerator.java:592)
    at com.fasterxml.jackson.databind.ObjectWriter$GeneratorSettings.initialize(ObjectWriter.java:1393)
    at com.fasterxml.jackson.databind.ObjectWriter._configureGenerator(ObjectWriter.java:1258)
    at com.fasterxml.jackson.databind.ObjectWriter.createGenerator(ObjectWriter.java:717)
    at com.fasterxml.jackson.databind.ObjectWriter.writeValues(ObjectWriter.java:753)
cowtowncoder commented 2 years ago

I think the problem may be Avro oddity where in data encoding as File requires use of header which is otherwise not used (or allowed) at all. It would be good to support "File" variant and there may already be an issue filed for it. But no work. It's bit tricky wrt API since Jackson does not have concept separating by input/output source type (the idea of different encoding for File seems specifically peculiar and ... well, bad idea, IMO).

willsoto commented 2 years ago

Ah okay...given everything I found I thought this was well supported - especially because of this particular bit AvroGenerator.Feature.AVRO_FILE_OUTPUT.

That particular feature is documented in JavaDoc and I found this as well: https://github.com/FasterXML/jackson-dataformats-binary/blob/169d2fbd4ec9f9f3d0aa155823e7c51de29237f6/avro/src/main/java/com/fasterxml/jackson/dataformat/avro/ser/RootContext.java#L107-L118

cowtowncoder commented 2 years ago

@willsoto Hmmh. I had actually forgotten about this being implemented. But had I read your example in detail, it would have been there.

I assume you have also tried disabling that to see what difference it makes? Is there matching reader (deserialization side) setting to go with it? Apologies for asking questions I should know answer for but I figured you have been investigating this and have good context.

willsoto commented 2 years ago

No worries! Appreciate you taking the time to help me out 😄

I assume you have also tried disabling that to see what difference it makes?

If I understand the question, I initially just tried the examples pretty much copy+pasted from the documentation so I didn't even know there was this AvroGenerator.Feature.AVRO_FILE_OUTPUT setting. It took quite a bit of searching to stumble upon it. In terms of example code, if you just remove the AvroFactory stuff, that is what I was trying initially.

Is there matching reader (de-serialization side) setting to go with it?

Not sure honestly. The way I've been testing is writing the file and then attempting to open it with avro-tools to prove it's valid and de-serializable.

cowtowncoder commented 2 years ago

Ok that makes sense.

Adding example files into a (new) unit test would be nice too. One challenge wrt Avro tho is that without file header it has zero metadata to detect valid data. This is unlike almost every other format, even protobuf has type tags etc for some level of self-descriptiveness.

willsoto commented 2 years ago

I'll try and add a test case this weekend.

Does the code I provided at least seem like it should work? I am curious if we can minimize the reproduction even further.

cowtowncoder commented 2 years ago

Oh. The part that possibly (likely?) will not work is the use of writeValues() (and SequenceWriter it creates) -- I suspect you cannot simply append root-level values in Avro, unlike in some other formats. So you may need to instead create a container (List) with matching root-level Avro type to describe the full type. But then again... Avro is designed for data streams so I am not 100% sure (it has been a while since I worked actively on this format module).