FundingCircle / jackdaw

A Clojure library for the Apache Kafka distributed streaming platform.
https://fundingcircle.github.io/jackdaw/
BSD 3-Clause "New" or "Revised" License
369 stars 80 forks source link

Allow Avro records to be configured to be closed/open #212

Closed james-doolan closed 4 years ago

james-doolan commented 4 years ago

The Avro specification states: "if the writer's record contains a field with a name not present in the reader's record, the writer's value for that field is ignored." However, the Avro serde will throw an exception in this case. This PR addresses this by allowing the Avro Record type to be configured to be open or closed (closed is default behaviour).

Checklist

cddr commented 4 years ago

Hey James,

Thanks for the PR. However, I think what that bit of the spec is actually talking about is what to do when the reader schema differs from the writer schema. So it is saying that when a writer schema was able to write some field that the reader schema doesn't know about, the written field is discarded by the reader.

The following section in the "Encoding and Evolution" chapter of Kleppman's DDIA explains it pretty well...

With Avro, when an application wants to encode some data (to write it to a file or database, to send it over the network, etc.), it encodes the data using whatever version of the schema it knows about—for example, that schema may be compiled into the application. This is known as the writer’s schema.

When an application wants to decode some data (read it from a file or database, receive it from the network, etc.), it is expecting the data to be in some schema, which is known as the reader’s schema. That is the schema the application code is relying on—code may have been generated from that schema during the application’s build process.

The key idea with Avro is that the writer’s schema and the reader’s schema don’t have to be the same—they only need to be compatible. When data is decoded (read), the Avro library resolves the differences by looking at the writer’s schema and the reader’s schema side by side and translating the data from the writer’s schema into the reader’s schema. The Avro specification [20] defines exactly how this resolution works, and it is illustrated in Figure 4-6.

For example, it’s no problem if the writer’s schema and the reader’s schema have their fields in a different order, because the schema resolution matches up the fields by field name. If the code reading the data encounters a field that appears in the writer’s schema but not in the reader’s schema, it is ignored. If the code reading the data expects some field, but the writer’s schema does not contain a field of that name, it is filled in with a default value declared in the reader’s schema.

Martin Kleppman, Designing Data Intensive Applications

An alternative approach you can take is to use select-keys on the object you want to write to ensure there are no unexpected fields in there.