Closed danskarda closed 1 year ago
I'm bumping into the same issue.
For simplicity's sake here's a simpler example:
(def a (l/record-schema :A [[:x l/int-schema]
[:y l/string-schema]]))
(def b (l/record-schema :B [[:x l/int-schema]
[:z l/string-schema]]))
(def u (l/union-schema [a b])) ;; =>
; Execution error (ExceptionInfo) at deercreeklabs.lancaster.utils/throw-overlapping-type-key (utils.cljc:874).
; Ambiguous union. Type key `:x` is shared by two schemas.
(Renaming the shared key :x
of course circumvents the issue.)
Like @danskarda I'm very curious to learn about why this isn't possible in Lancaster. The Avro spec doesn't seem to prohibit it, and from what I understand from @danskarda other implementations don't seem to either (including the official one).
Thanks.
Hi @danskarda and @bartkl, The ambiguous schema limitation is due to a decision I made several years ago. When mapping Avro to and from Clojure data, I chose to represent Avro records as Clojure maps. Avro records have names, but Clojure maps do not. This means that when serializing, Lancaster can't always know which record is intended. I chose to make such situations illegal (the aforementioned ambiguous union exception). While this avoids the problem, it doesn't meet the goal of being compatible with any valid Avro schema.
I'd like to change things to allow such schemas. My proposal is:
:lancaster/type
key in the map.:lancaster/type
key in the map. There will also be an option to suppress this, if desired.Let me know your thoughts. If you agree, I will plan to implement this sometime in the next month.
That sounds good to me.
Thanks!
Hi Chad, I have several ideas from different viewpoints (somewhat conflicting)
Serialize and deserialize
Would you also extend deserialize to include ::lancaster/type
in unions?
It would be a nice property if you deserialize and serialize it again, you get the same AVRO message. I.e. (->> bytes (lancaster/deserialize schema schema) (lancaster/serialize schema)
it would return the same message.
If I take this one step further. Why deserialize does not return ::lancaster/type
always for all maps? Sometimes this could be handy. Because ::lancaster/type
is namespaced, consumers can safely ignore them.
Data or meta-data
Does ::lancaster/type
belong to data or metadata? Using metadata instead of map keyword adds more work on producers and consumers. On the other hand, you can attach metadata also to vectors and support unions of arrays or attach metadata everywhere.
Data vs callbacks / handlers
When you request producers to include ::lanster/type
you put more implementation work on their side. For shallow data structures this can be fine, however traversing complex nested data structures and placing ::lancaster/type
where appropriate would be tedious. Some helping hand would be nice.
::lancaster/type
is not only a keyword, it is a function. So you can look on a type dispatcher as a function with default value ::lancaster/type
(or (comp ::lancaster/type meta)
if you decide to use metadata instead of keyword).
The question is how you pass this function:
a. it can be constant (ie ::lancaster/type
)
b. you can use dynamic var in spirit of clojure *data-readers*
or *default-data-reader-fn*
c. it can be passed as additional parameter / options to data readers
I find proposed solution a. too limiting (forces me to traverse and prepare data). In b and c are fine. Maybe I prefer additional argument over default calling context and dynamic var.
High-level musing I. - complecting vs untangling
In previous question I parametrized serialization with a callback / hook. Did I just messed two things together? First step of data preprocessing / interpretation (ie which data is which union type) and second step of serialization.
To express it in code
Mixed variant (->> (extract-data db) (serialize schema decode-union-types))
Separate variant (->> (extract-data db) interpret-union-types add-union-types (serialize schema))
Where decode-union-types is a function which returns AVRO type for ambivalent union values and add-union-type is a transformation which enhances data with ::lancaster/type
(or some other metadata).
When I decide for simplicity and keep things separate, I can provide some utilities to make add-union-types easy.
To help users with writing add-union-types I can provide a helper function in spirit of clojure.walk/prewalk
. Something like (lancaster/prewalk f schema data)
which would walk both data and schema and help me to add ::lancaster/type
where I need them in deep in nested data structures.
High-level musing II. - clojure.spec unions
clojure.spec also defines union type (s/or
) and is able to validate it and conform it (transform to less ambivalent structure). Maybe if Lancaster is more tied to core.spec, it can take advantage of its implementation, serialize using conformed data and union types would be for free. The question is at what price (dependency on clojure.spec, speed etc).
Sorry, no definitive answer :D
To summarize
::lancaster/type
and users will pay the price with additional preprocessing::lancaster/type
to a function (and choose how to configure it)::lancaster/type
in data or metadataBest, Dan
@danskarda Thank you for the thoughtful commentary. I appreciate it. I have been busy wrapping up a product release for work. I will get back to Lancaster next week.
Hey @chadharrington,
Is it possible for you to give a status update?
Thanks
@bartkl My apologies. I have been focused on some work priorities and have not dealt with this issue. I plan to implement the "cheap solution" as Dan describes it above. I will complete it by April 1.
No problem Chad. Although I do hope that's not an April Fool's joke ;) haha.
@bartkl I have not forgotten about this :-) The implementation is mostly done and I will wrap it up in the next few days. Thanks for your patience.
Absolutely no problem, Chad. Thanks!
@bartkl @danskarda This has been added to the 0.11.0 release of Lancaster. You can read about the ambiguous schema support here: https://github.com/deercreeklabs/lancaster#the-lancasterrecord-name-attribute
Let me know if this fails to meet your needs for some reason. Best wishes, Chad
I have a simple schema with union of records (pretty printed with cheshire):
I can read the schema using Apache Avro Java library. Using GenericData I can serialize and deserialize data.
When records have distinct field names I can read and write data from Lancaster. When field names overlap, I get "Ambiguous union" exception and cannot even convert schema from json. Same result when I try to make such union schema using Lancaster's functions.
I understand this is a problem how to dispatch a serialization of union of records. Lancaster uses presence of unique fields to dispatch serialization, thus it avoids ambiguity at first place.
Is this kind of AVRO schema considered anti-pattern and you would rather not support it? I have seen some Java people struggling with union of records schema on Stack Overflow. However the Java library supports it so it is hard for me to reason if it these kind of schemas are okish or cause problems down the road and are considered anti-pattern.
In Clojure dispatch can be done using different means (defrecord + protocols, dispatch field, multi-methods). Also solution should support both schemas defined in json files and schemas defined in Clojure using Lancaster's functions.
Probably good solution would be to use name-spaced type name and data to dispatch and serialize. Either serialize and deserialize methods have additional options (or optional schema attributes) which will be hooks for serialization or deserialization (I think abracad has similar approach).
Similar problem is how to hint logicalTypes, eg convert time-millis from and to java.time.Instant etc.