elodina / go-avro

Apache Avro for Golang
http://elodina.github.io/go-avro/
Apache License 2.0
129 stars 55 forks source link

[Question] binary without schema embedded #87

Open MarcMagnin opened 7 years ago

MarcMagnin commented 7 years ago

Hi,

I was wondering if that was possible to encode in binary without embedding the schema within the message?

Many thanks, Marc

samv commented 7 years ago

FWIW, the Avro spec does not specify a format for this. Typical behavior is to write segments of 10 or so rows with the embedded schema at the front (github.com/linkedin/goavro), write a "schema ID" at the start of emitted rows (bottledwater, and presumably the schema ID is the 128-bit MD5 specified in the RPC handshake section of the spec), or just write out JSON rows (eg, confluent Kafka REST proxy and at least one Database -> Kafka CDC tool I looked at).

This is because unlike Thrift, Protocol Buffers, etc, Avro's binary format is not forward compatible. This "saves space" on larger files and "forces everyone to implement the schema protocol" or something like that.

crast commented 7 years ago

The avro container file format includes the schema, so that in the future a reader could be able to parse the file, even if schemas were to change.

As a matter of course though, the schema is not technically necessary so long as the receiving/reading end knows the schema of what it's getting. This could be simply hard-coded or agreed upon by the communicating ends, or communicated a different way than sending an object container format, like by sending the MD5 hash of the schema before sending the record (which is used in the Avro-RPC protocol, for example). How you implement that though is not handled within a formal part of the avro spec.

Important note: Avro Binary serialization are not inherently forwards or backwards compatible unless the reader can know the exact schema the record was encoded with. This means that if you make any changes, including new fields, adding defaults, adding new options to a type union, or even adding entries to an enum, this is considered a new and different schema and without knowing that this is a different schema, the reader is likely to fail.

samv commented 7 years ago

I don't dispute any of that. However I should issue a correction I've discovered: the Confluent platform has invented its own avro binary format for efficient binary representation of a single row. I thought it was writing JSON but it appears I read the Java sources wrong. The row format consists of null byte, a 32-bit schema ID, and then binary data column by column. I'm not sure how the 32-bit schema ID is generated; it's nothing canonical (and might be a Kafka Schema Registry allocated identifier)

crast commented 7 years ago

Side note, I was trying to avoid advertising, but since this project has stopped responding to PR's for 1 year now, and I contacted the original maintainer back last year and he said he is no longer able to access the elodina project, I am going to mention that I've forked this project here:

https://github.com/go-avro/avro#about-this-fork

The new Go import path is gopkg.in/avro.v0