Converting field name with namespace to json and back

danskarda commented 1 year ago

Hello, thank you for implementation of Avro library for Clojure.

Disclaimer: I am new to Avro and maybe have wrong understanding of how it should work. I played with Lancaster with intention to make small proof of concept and found some odd behavior, when working with namespaces.

Expected behavior: when I write schema with json and read it back using json->schema I get identical schema.

What I got instead: In following example I got different schema where the last namespace part of the field name was removed and converted to part of the name (:sandbox.avro.lancaster/id -> :sandbox.avro/lancaster-id).

(ns sandbox.avro.lancaster
  (:require [cheshire.core :as json]
            [deercreeklabs.lancaster :as avro :refer [def-record-schema]]
            [deercreeklabs.lancaster.schemas :as schema]))

(def-record-schema example
  [::id :required avro/long-schema])

;; schema has field ::id

(def example-derived (-> example
                         avro/json
                         avro/json->schema))

;; schema has field :sandbox.avro/lancaster-id

(->> (avro/serialize example {::id 1})
     (avro/deserialize example-derived example))

;; => #:sendbox.avro{:lancaster-id -1}

chadharrington commented 1 year ago

Hi Dan, Thank you for filing this issue. I appreciate the clear example. I will dig into this over the weekend and should have a fix by Monday.

chadharrington commented 1 year ago

Dan, The issue is that I missed (or forgot) an important detail in the Avro Spec (https://avro.apache.org/docs/1.11.1/specification/#names):

Record fields and enum symbols have names as well (but no namespace).

At some point, I added the ability to have namespaces on record fields and enums. This is in violation of the spec. I am in the process of removing that support and adding appropriate error-checking code. This is, unfortunately, a breaking change. However, Lancaster is intended to faithfully implement the spec, and this feels like the right choice.

I plan to have a release ready tomorrow.

danskarda commented 1 year ago

Hi Chad, thank for investigation of the issue and for fast feedback.

I also read the spec incorrectly, even examples :D I would call it Clojure induced blindness. At this moment I take Clojure namespaced keywords for granted, that I wanted AVRO to support them. And I missed the sentence in the spec and read the example bellow in an incorrect way.

There are namespaces in the example in AVRO spec (bellow text you have quoted). BUT these are namespaces of types of fields, not namespaces of fields :(

I fully agree that Lancaster should faithfully implement the spec. However the question I would like to ask is how to support the "Clojure way" - Clojure records with namespaced keys (eg as used clojure.spec, Datomic attributes, Pathom etc). By support I mean that I can serialize and deserialize Clojure records with fully qualified keywords without further pre-processing or post-processing. This could be achieved both in Lancaster or some thin layer on top of it

Would this proposal make sense for you?

allow to define Lancaster AVRO record fields with namespaces
export schema to json When exporting schema to json, remove namespaces from field names to stay compliant. Signal error when there are two fields with same name but different namespace.
serialize clojure data to json Would be the same, however it would allow to pass clojure records with fully qualified keywords
deserialize data to clojure records Deserialization requires write schema and read schema. Writer schema can be defined by Lancaster or read from json. When reader schema is defined using Lancaster, we can match names without namespaces and write records with namespace from reader schema. Again if there is ambiguity, you can raise an error.

This approach has both pros and cons:

Lancaster stays compatible with AVRO schema (when exporting to JSON). However the fact that it allows fully qualified keywords in namespace could be confusing.
Lancaster would support Clojure data structures with namespaced keywords
Deserialization name->namespace/name would be fast providing there is a deserialization cache in Lancaster schema

What do you think? Does this conversion and enhanced support for Clojure belong to Lancaster core or some helper namespace?

chadharrington commented 1 year ago

@danskarda I will get back to this over the weekend. Thanks.

chadharrington commented 1 year ago

Hi Dan, Thanks for taking the time to make a thoughtful proposal. I have considered it as well as other ways to map namespaced keywords onto and from non-namespaced Avro record keys. While such mappings could be made, ultimately, they feel very application dependent. There are many ways to perform this mapping (drop the namespace, encode the namespace into the name, etc.), but there are pros and cons to each and no single "best" way to do it.

Thus, I am going to keep Lancaster focused on its core mission: serializing and deserializing Clojure data that matches an Avro schema. Translating between Clojure data that doesn't match the Avro schema and the Avro-compliant form can be done in whatever way makes the most sense for the application.

It's important that Lancaster schemas map 1:1 to the Avro schemas, as we have systems in use that translate between them frequently. We need to be able to make a Lancaster schema, send the JSON version of that schema to a remote peer, then have the remote peer be able to create the same Lancaster schema from the JSON. Thus, I am not going to add things to the Lancaster schemas that can't be represented in the Avro-compliant JSON schema.

In the future, if a common use case arises for multiple users, perhaps we can add a helper namespace that includes some translation functions. At the moment, I feel that there is not a consensus on what those functions should do, so we'll wait on that.

The upcoming version of Lancaster will throw proper errors if one tries to create a schema with namespaced record keys or namespaced enum symbols, both of which are prohibited by the Avro spec.

Thanks again for bringing this issue to my attention; my misreading of the spec led to this issue. I will close this issue when the updated Lancaster version is released.

Best regards, Chad

chadharrington commented 1 year ago

@danskarda With the release of v0.10.0, I am closing this issue now. If you have additional ideas, etc., feel free to open a new issues. Best wishes, Chad

danskarda commented 1 year ago

Hi @chadharrington, thank you for considering the issue.

Maybe this can be resolved with a section in a documentation, wiki page or blog-post. How to use Lancaster to build Clojure systems and document some common patterns you have used in your applications and systems. For example

how do you use Lancaster to define a schema (using functions or reading json)

Json files can be shared with other applications (or in schema registry). On other hand Clojure data structures can be extended with some hints to improve serialization / deserialization.
in your applications do you use data as defined by Lancaster schema or you convert them to internal format?

Eg other application defined schema with record with UserGUID but you convert it to ::user/id because that's what other parts of your clojure system is using.
if you use internal format, do you use some library or use hand-written functions?

There are some schema coercion libraries, but afair they manipulate types, not keyword names (eg they convert string to number or number to java.util.Date, but do not convert :name to :user/name).
how do you coerce data to and from local types (eg java.util.Date to and back from logicalType time-milllis etc)
how do you check the schema? You use clojure.spec (requires namespaced keywords)?
how to integrate with the rest of the Clojure word (eg Datomic, spec etc)

Some choices for integration were already made (like conversion from camel case to kebab case). Many can be solved using pre-processing or post-processing (namespaces, type conversions). Some has to be solved with hooks in (de)serialization (eg serialization of union types in #27).

Maybe pre-processing / post-processing would benefit from walk-like function which would traverse data with schema in similar fashion to clojure.walk/prewalk or postwalk.

Best, Dan

deercreeklabs / lancaster

Converting field name with namespace to json and back #25