Closed bartkl closed 2 years ago
Hi Bart, Thanks for your question. Due to time constraints, I will reply on Wednesday.
Hi Bart, You are correct that defaults are sufficient for schema evolution. However, I don't recommend having record fields be required (not optional via union with the null schema) for two reasons:
nil
is much easier to understand and handle correctly.It sounds like maybe you want to use lancaster / Avro to validate the data. As I mentioned in our prior discussion, I recommend keeping validation separate from encoding:
I think you want to keep the concepts of data encoding and data validation separate. Your data encoding layer should be flexible and support backward and forward compatibility / evolution at any point in time. Your data validation layer should be strict and validate what is and isn't acceptable to the code at that point in time. The data encoding layer will outlast the code and must be able to handle future changes. The data validation is, by definition, tied to a particular version of the code. It can (and should) handle things like adapting old data to new schemas, informing the user of errors, etc. These validation concerns should not be complected with the data encoding, which needs to be flexible in order to support the changes that will inevitably happen.
If you really want to, you can make :required
fields, as illustrated in this REPL session:
user> (require '[deercreeklabs.lancaster :as l])
nil
user> (l/def-record-schema flexible-schema
[:name l/string-schema]
[:age l/int-schema])
#'user/flexible-schema
user> (l/edn flexible-schema)
{:name :user/flexible,
:type :record,
:fields
[{:name :name, :type [:null :string], :default nil}
{:name :age, :type [:null :int], :default nil}]}
user> (l/def-record-schema rigid-schema
[:name :required l/string-schema]
[:age :required l/int-schema])
#'user/rigid-schema
user> (l/edn rigid-schema)
{:name :user/rigid,
:type :record,
:fields
[{:name :name, :type :string, :default ""}
{:name :age, :type :int, :default -1}]}
user> (l/serialize flexible-schema {:age 12})
[0, 2, 24]
user> (l/serialize rigid-schema {:age 12})
Execution error (ExceptionInfo) at deercreeklabs.lancaster.utils/throw-missing-key-error (utils.cljc:997).
Record data is missing key `:name`. Path: []. Data: {:age 12}.
user>
To summarize: You can support schema evolution using defaults, but optional fields are much better for almost all cases.
Hi Chad,
Once again I'm learning so much from you :). The book you recommended is right up there on my reading list by the way, but in the meantime I had to wrap my head around some of this stuff quicker.
Just a little background so you know what I'm struggling with: our project should generate Avro schemas (and others to come) from a SHACL model. During this transformation I need to balance between preserving the information modeled in (the more expressive, information-dense) SHACL model, and generating a useful Avro schema that is suited for real world use. But as you point out, rightfully: losing some information (such as requiredness) isn't a bad thing here, and the validation should simply be one elsewhere. I like your comment (in our previous discussion) about validation logic being tightly coupled to a specific version as well. There you already convinced me of separating validation and evolution by the way, I just wasn't sure if the optionality you encourage heavily was necessary from just the schema evolution point of view. But indeed, the points you raised definitely make a lot of sense.
I was aware of the :require
keyword, and now feel convinced I can remove it. Before actually getting there I should probably first discuss this in the team, but at least you taught me something ;).
Thanks!
If you are generating Lancaster schemas from another data source, it may be easiest to use lancaster's edn->schema
fn. You could transform your external data into EDN, then turn the EDN into lancaster schemas:
user> (require '[deercreeklabs.lancaster :as l])
nil
user> (def my-edn-schema
{:name ::my-record,
:type :record,
:fields
[{:name :name, :type [:null :string], :default nil}
{:name :age, :type [:null :int], :default nil}]})
#'user/my-edn-schema
user> (def my-lancaster-schema (l/edn->schema my-edn-schema))
#'user/my-lancaster-schema
user> (l/serialize my-lancaster-schema {:name "Chad" :age 10})
[2, 8, 67, 104, 97, 100, 2, 20]
This is the most natural way to define schemas from data. However, there is little error checking in this path; the user is expected to know the proper EDN structures to define the various Lancaster schemas. That's not really well documented, though you can always define a schema using def-record-schema
, record-schema
, etc., then call l/edn
on it to see how the EDN is structured. The EDN structure is intended to match the Avro specification, but in EDN instead of JSON.
If you are defining schemas this way, you can leave off the unions with :null
(making things :required
), provide your own :default
values, etc. Again, doing this should be very deliberate and I don't recommend either of those things in the general case.
Let me know if you have further questions. You can always reach me at chad@deercreeklabs.com. I will close this issue for now.
Hey Chad,
Thanks again for pointing me towards good practices about schem evolution. I now understand why you make all fields optional by default, and heavily discourage required fields. After having read up a little about this in the context of Avro though, I do wonder if perhaps it is sufficient to only encourage default values that way, and not (also) make the type nullable. I might not have given this enough thought yet, but I seem to be reading suggestions all over the place that that's sufficient:
The Schema Resolution section of the spec also mention exclusively the default values, nowhere the nullable type.
Finally, in these Oracle Docs the same seems to be the case.
Before I break my head over this, I thought I'd dicuss the idea with you :).
In short: can your goal of "making everything optional" be reached solely by providing/generating default values, without also making the type nullable?
If this indeed does suffice, it allows for more expressive schemas and a clearer API in my opinion (since it can seem a little weird at first that the field type
l/string-schema
yields a["null", "string"]
type).Thanks! Bart