deercreeklabs / lancaster

Apache Avro library for Clojure and ClojureScript
Other
60 stars 5 forks source link

Default values without nullability seems sufficient for schema evolution #21

Closed bartkl closed 2 years ago

bartkl commented 2 years ago

Hey Chad,

Thanks again for pointing me towards good practices about schem evolution. I now understand why you make all fields optional by default, and heavily discourage required fields. After having read up a little about this in the context of Avro though, I do wonder if perhaps it is sufficient to only encourage default values that way, and not (also) make the type nullable. I might not have given this enough thought yet, but I seem to be reading suggestions all over the place that that's sufficient:

"In Avro and Protobuf, you can define fields with default values. In that case, adding or removing a field with a default value is a fully compatible change." (https://docs.confluent.io/platform/current/schema-registry/avro.html)

The Schema Resolution section of the spec also mention exclusively the default values, nowhere the nullable type.

Finally, in these Oracle Docs the same seems to be the case.

Before I break my head over this, I thought I'd dicuss the idea with you :).

In short: can your goal of "making everything optional" be reached solely by providing/generating default values, without also making the type nullable?

If this indeed does suffice, it allows for more expressive schemas and a clearer API in my opinion (since it can seem a little weird at first that the field type l/string-schema yields a ["null", "string"] type).

Thanks! Bart

chadharrington commented 2 years ago

Hi Bart, Thanks for your question. Due to time constraints, I will reply on Wednesday.

chadharrington commented 2 years ago

Hi Bart, You are correct that defaults are sufficient for schema evolution. However, I don't recommend having record fields be required (not optional via union with the null schema) for two reasons:

  1. It requires that the field always be present and non-nil in the data to be encoded. It's true that most of the time the data will be there, but, in my experience, as an application grows, there will be a few cases where a field is is not needed or present at a particular point in the system. Creating a separate, nearly-identical schema without the missing field is a pain and causes duplication and lots of extra schemas.
  2. Let's say that a writer decides to stop sending a field, and encodes data using a compatible schema that does include the field. The reader then decodes the data using the prior schema which has the missing field (with a default). Now, the reader gets data with the field filled in using the default. How does the reader know whether the writer sent that data and it should be treated as meaningful or if it omitted the data and it should be ignored? Carefully choosing default values can help with this, but is error prone. Getting nil is much easier to understand and handle correctly.

It sounds like maybe you want to use lancaster / Avro to validate the data. As I mentioned in our prior discussion, I recommend keeping validation separate from encoding:

I think you want to keep the concepts of data encoding and data validation separate. Your data encoding layer should be flexible and support backward and forward compatibility / evolution at any point in time. Your data validation layer should be strict and validate what is and isn't acceptable to the code at that point in time. The data encoding layer will outlast the code and must be able to handle future changes. The data validation is, by definition, tied to a particular version of the code. It can (and should) handle things like adapting old data to new schemas, informing the user of errors, etc. These validation concerns should not be complected with the data encoding, which needs to be flexible in order to support the changes that will inevitably happen.

If you really want to, you can make :required fields, as illustrated in this REPL session:

user> (require '[deercreeklabs.lancaster :as l])
nil
user> (l/def-record-schema flexible-schema
        [:name l/string-schema]
        [:age l/int-schema])
#'user/flexible-schema
user> (l/edn flexible-schema)
{:name :user/flexible,
 :type :record,
 :fields
 [{:name :name, :type [:null :string], :default nil}
  {:name :age, :type [:null :int], :default nil}]}
user> (l/def-record-schema rigid-schema
        [:name :required l/string-schema]
        [:age :required l/int-schema])
#'user/rigid-schema
user> (l/edn rigid-schema)
{:name :user/rigid,
 :type :record,
 :fields
 [{:name :name, :type :string, :default ""}
  {:name :age, :type :int, :default -1}]}
user> (l/serialize flexible-schema {:age 12})
[0, 2, 24]
user> (l/serialize rigid-schema {:age 12})
Execution error (ExceptionInfo) at deercreeklabs.lancaster.utils/throw-missing-key-error (utils.cljc:997).
Record data is missing key `:name`. Path: []. Data: {:age 12}.
user> 

To summarize: You can support schema evolution using defaults, but optional fields are much better for almost all cases.

bartkl commented 2 years ago

Hi Chad,

Once again I'm learning so much from you :). The book you recommended is right up there on my reading list by the way, but in the meantime I had to wrap my head around some of this stuff quicker.

Just a little background so you know what I'm struggling with: our project should generate Avro schemas (and others to come) from a SHACL model. During this transformation I need to balance between preserving the information modeled in (the more expressive, information-dense) SHACL model, and generating a useful Avro schema that is suited for real world use. But as you point out, rightfully: losing some information (such as requiredness) isn't a bad thing here, and the validation should simply be one elsewhere. I like your comment (in our previous discussion) about validation logic being tightly coupled to a specific version as well. There you already convinced me of separating validation and evolution by the way, I just wasn't sure if the optionality you encourage heavily was necessary from just the schema evolution point of view. But indeed, the points you raised definitely make a lot of sense.

I was aware of the :require keyword, and now feel convinced I can remove it. Before actually getting there I should probably first discuss this in the team, but at least you taught me something ;).

Thanks!

chadharrington commented 2 years ago

If you are generating Lancaster schemas from another data source, it may be easiest to use lancaster's edn->schema fn. You could transform your external data into EDN, then turn the EDN into lancaster schemas:

user> (require '[deercreeklabs.lancaster :as l])
nil
user> (def my-edn-schema
        {:name ::my-record,
         :type :record,
         :fields
         [{:name :name, :type [:null :string], :default nil}
          {:name :age, :type [:null :int], :default nil}]})
#'user/my-edn-schema
user> (def my-lancaster-schema (l/edn->schema my-edn-schema))
#'user/my-lancaster-schema
user> (l/serialize my-lancaster-schema {:name "Chad" :age 10})
[2, 8, 67, 104, 97, 100, 2, 20]

This is the most natural way to define schemas from data. However, there is little error checking in this path; the user is expected to know the proper EDN structures to define the various Lancaster schemas. That's not really well documented, though you can always define a schema using def-record-schema, record-schema, etc., then call l/edn on it to see how the EDN is structured. The EDN structure is intended to match the Avro specification, but in EDN instead of JSON. If you are defining schemas this way, you can leave off the unions with :null (making things :required), provide your own :default values, etc. Again, doing this should be very deliberate and I don't recommend either of those things in the general case.

Let me know if you have further questions. You can always reach me at chad@deercreeklabs.com. I will close this issue for now.