deercreeklabs / lancaster

Apache Avro library for Clojure and ClojureScript
Other
60 stars 5 forks source link

How to create required fields without default value #20

Closed bartkl closed 2 years ago

bartkl commented 2 years ago

Hello,

When I create a record schema and set a field as :required and provide no default values, upon serialization Lancaster creates (sensible) default values for me.

Is it possible to not have it do that? We'd like to use the generated JSON schema in applications and with these generated default values, we don't get the validation we need.

Lancaster is a breeze, thanks! Bart

chadharrington commented 2 years ago

Hi Bart,

Thanks for taking the time to open this issue. I think I need to understand your use case better. Once I understand the situation better, I can recommend a course of action and possibly make changed to how Lancaster handles that situation.

In the normal case, when you don't add :required to a record field, Lancaster makes that field optional by making its type be a union of :null and the declared schema. This is almost always what you want, as it allows you to evolve the schema in the future.

For example, you start with a person-schema record with a :name field. Later, you decide to change the record to have separate :given-name and :family-name fields instead of a single :name field. If you have marked the :name field as required, you will have to continue to include it forever along with the new fields. You could remove the :name field from the schema, but that is not recommended, because removing the field means that if you read old data which has that field, Lancaster will skip that field and its contents will be lost. In general, you want all fields to be optional and you never want to remove any fields from your schema once they have been used. You should read in any old/deprecated fields and convert the old format to the new format in your application. Then you can store the data in the new format. If you want, you can mark the field as deprecated by adding a docstring to the field in the schema.

So, in general, I recommend not using the :required attribute unless you really know why you are doing so. It does save one byte per field in the encoding, but that's generally not worth the tradeoff.

Thanks and best wishes,

Chad

bartkl commented 2 years ago

Hi Chad,

Questioning whether we really want to make these fields required is a great approach. I'm going to discuss this with my colleague who's responsible for this ASAP. For now, I'll try to provide more context myself.

We generate AVRO schemas from incoming data models (expressed in RDF documents), which developer teams will be using to create Kafka messages in their applications. From what I understand, some fields cannot be left out for the message to be meaningful. To follow your example: to have a person-schema with no name of any kind might be said to lack essential information. In those cases, we'd like to generate a validation error. Would you discourage requiring a field like that, still? I am interested in your take.

Thanks! Bart

chadharrington commented 2 years ago

Hi Bart, I appreciate you taking the time to explain your use case better. I think you want to keep the concepts of data encoding and data validation separate. Your data encoding layer should be flexible and support backward and forward compatibility / evolution at any point in time. Your data validation layer should be strict and validate what is and isn't acceptable to the code at that point in time. The data encoding layer will outlast the code and must be able to handle future changes. The data validation is, by definition, tied to a particular version of the code. It can (and should) handle things like adapting old data to new schemas, informing the user of errors, etc. These validation concerns should not be complected with the data encoding, which needs to be flexible in order to support the changes that will inevitably happen.

I suggest you not make any fields required in the data encoding layer (Lancaster / Avro). To continue the example, a person record with no name of any kind should be caught and handled appropriately by the data validation layer. Depending on the situation, the validation layer may want to signal an error, fill in a default, derive the data from other fields in the record, fetch the missing data from an external source, etc. None of that should be encoded in the data itself, especially since requiring fields in the encoding prevents you ever from changing either the name or type of that information.

Rich Hickey's talk "Maybe Not" explores these ideas well: https://www.youtube.com/watch?v=YR5WdGrpoug Another Rich talk, "Spec-ulation" also teaches these concepts in the context of API/library naming, which is closely related: https://www.youtube.com/watch?v=oyLBGkS5ICk Finally, Chapter 4 of Designing Data Intensive Applications covers this topic as well: https://dataintensive.net/

I hope this response is useful to you. I'd be happy to help you think through your architecture or related design decisions. You can reach me directly at chad@deercreeklabs.com.

I am going to close this issue for now. Best regards, Chad

bartkl commented 2 years ago

I meant to write a response but apparently completely forgot. You've given me some deep stuff to think about and watch/read. Thanks for the offer of help too, I highly appreciate it. Soon it will be decided whether Clojure is the language of choice in our team, but the company is not going to make it easy. Let's hope for the best ;).

Thanks a lot!