Ambiguous union - Githubissues

danskarda commented 1 year ago

I have a simple schema with union of records (pretty printed with cheshire):

{
  "name" : "message",
  "type" : "record",
  "fields" : [ {
    "name" : "events",
    "type" : {
      "type" : "array",
      "items" : [ {
        "type" : "record",
        "name" : "EventInt",
        "fields" : [ {
          "name" : "subject",
          "type" : "string"
        }, {
          "name" : "value",
          "type" : "int"
        } ]
      }, {
        "type" : "record",
        "name" : "EventString",
        "fields" : [ {
          "name" : "subject",
          "type" : "string"
        }, {
          "name" : "value",
          "type" : "string"
        } ]
      } ],
      "default" : [ ]
    }
  } ]
}

I can read the schema using Apache Avro Java library. Using GenericData I can serialize and deserialize data.

When records have distinct field names I can read and write data from Lancaster. When field names overlap, I get "Ambiguous union" exception and cannot even convert schema from json. Same result when I try to make such union schema using Lancaster's functions.

I understand this is a problem how to dispatch a serialization of union of records. Lancaster uses presence of unique fields to dispatch serialization, thus it avoids ambiguity at first place.

Is this kind of AVRO schema considered anti-pattern and you would rather not support it? I have seen some Java people struggling with union of records schema on Stack Overflow. However the Java library supports it so it is hard for me to reason if it these kind of schemas are okish or cause problems down the road and are considered anti-pattern.

In Clojure dispatch can be done using different means (defrecord + protocols, dispatch field, multi-methods). Also solution should support both schemas defined in json files and schemas defined in Clojure using Lancaster's functions.

Probably good solution would be to use name-spaced type name and data to dispatch and serialize. Either serialize and deserialize methods have additional options (or optional schema attributes) which will be hooks for serialization or deserialization (I think abracad has similar approach).

Similar problem is how to hint logicalTypes, eg convert time-millis from and to java.time.Instant etc.

bartkl commented 1 year ago

I'm bumping into the same issue.

For simplicity's sake here's a simpler example:

  (def a (l/record-schema :A [[:x l/int-schema]
                              [:y l/string-schema]]))
  (def b (l/record-schema :B [[:x l/int-schema]
                              [:z l/string-schema]]))
  (def u (l/union-schema [a b]))  ;; =>
; Execution error (ExceptionInfo) at deercreeklabs.lancaster.utils/throw-overlapping-type-key (utils.cljc:874).
; Ambiguous union. Type key `:x` is shared by two schemas.

(Renaming the shared key :x of course circumvents the issue.)

Like @danskarda I'm very curious to learn about why this isn't possible in Lancaster. The Avro spec doesn't seem to prohibit it, and from what I understand from @danskarda other implementations don't seem to either (including the official one).

Thanks.

chadharrington commented 1 year ago

Hi @danskarda and @bartkl, The ambiguous schema limitation is due to a decision I made several years ago. When mapping Avro to and from Clojure data, I chose to represent Avro records as Clojure maps. Avro records have names, but Clojure maps do not. This means that when serializing, Lancaster can't always know which record is intended. I chose to make such situations illegal (the aforementioned ambiguous union exception). While this avoids the problem, it doesn't meet the goal of being compatible with any valid Avro schema.

I'd like to change things to allow such schemas. My proposal is:

Remove the ambiguous union restriction on schemas
At serialization time, if a union is ambiguous, the caller must provide the record name via a :lancaster/type key in the map.
If a this key is not present in the map, an exception is thrown.
At deserialization time, if a union is ambiguous, the returned record will include a :lancaster/type key in the map. There will also be an option to suppress this, if desired.

Let me know your thoughts. If you agree, I will plan to implement this sometime in the next month.

bartkl commented 1 year ago

That sounds good to me.

Thanks!

danskarda commented 1 year ago

Hi Chad, I have several ideas from different viewpoints (somewhat conflicting)

Serialize and deserialize

Would you also extend deserialize to include ::lancaster/type in unions?

It would be a nice property if you deserialize and serialize it again, you get the same AVRO message. I.e. (->> bytes (lancaster/deserialize schema schema) (lancaster/serialize schema) it would return the same message.

If I take this one step further. Why deserialize does not return ::lancaster/type always for all maps? Sometimes this could be handy. Because ::lancaster/type is namespaced, consumers can safely ignore them.
Data or meta-data

Does ::lancaster/type belong to data or metadata? Using metadata instead of map keyword adds more work on producers and consumers. On the other hand, you can attach metadata also to vectors and support unions of arrays or attach metadata everywhere.
Data vs callbacks / handlers

When you request producers to include ::lanster/type you put more implementation work on their side. For shallow data structures this can be fine, however traversing complex nested data structures and placing ::lancaster/type where appropriate would be tedious. Some helping hand would be nice.

::lancaster/type is not only a keyword, it is a function. So you can look on a type dispatcher as a function with default value ::lancaster/type (or (comp ::lancaster/type meta) if you decide to use metadata instead of keyword).

The question is how you pass this function:

a. it can be constant (ie ::lancaster/type) b. you can use dynamic var in spirit of clojure *data-readers* or *default-data-reader-fn* c. it can be passed as additional parameter / options to data readers

I find proposed solution a. too limiting (forces me to traverse and prepare data). In b and c are fine. Maybe I prefer additional argument over default calling context and dynamic var.
High-level musing I. - complecting vs untangling

In previous question I parametrized serialization with a callback / hook. Did I just messed two things together? First step of data preprocessing / interpretation (ie which data is which union type) and second step of serialization.

To express it in code
- Mixed variant (->> (extract-data db) (serialize schema decode-union-types))
- Separate variant (->> (extract-data db) interpret-union-types add-union-types (serialize schema))
Where decode-union-types is a function which returns AVRO type for ambivalent union values and add-union-type is a transformation which enhances data with ::lancaster/type (or some other metadata).

When I decide for simplicity and keep things separate, I can provide some utilities to make add-union-types easy.

To help users with writing add-union-types I can provide a helper function in spirit of clojure.walk/prewalk. Something like (lancaster/prewalk f schema data) which would walk both data and schema and help me to add ::lancaster/type where I need them in deep in nested data structures.
High-level musing II. - clojure.spec unions

clojure.spec also defines union type (s/or) and is able to validate it and conform it (transform to less ambivalent structure). Maybe if Lancaster is more tied to core.spec, it can take advantage of its implementation, serialize using conformed data and union types would be for free. The question is at what price (dependency on clojure.spec, speed etc).

Sorry, no definitive answer :D

To summarize

for cheap solution go with ::lancaster/type and users will pay the price with additional preprocessing
for more general solution, use metadata - it can be used on arrays / vectors as well
if you want to give more freedom, generalize ::lancaster/type to a function (and choose how to configure it)
if you want to keep serialization simple, stay with ::lancaster/type in data or metadata
but if you have time in your hands / hammock, think about lancaster.walk or using spec :)

Best, Dan

chadharrington commented 1 year ago

@danskarda Thank you for the thoughtful commentary. I appreciate it. I have been busy wrapping up a product release for work. I will get back to Lancaster next week.

bartkl commented 1 year ago

Hey @chadharrington,

Is it possible for you to give a status update?

Thanks

chadharrington commented 1 year ago

@bartkl My apologies. I have been focused on some work priorities and have not dealt with this issue. I plan to implement the "cheap solution" as Dan describes it above. I will complete it by April 1.

bartkl commented 1 year ago

No problem Chad. Although I do hope that's not an April Fool's joke ;) haha.

chadharrington commented 1 year ago

@bartkl I have not forgotten about this :-) The implementation is mostly done and I will wrap it up in the next few days. Thanks for your patience.

bartkl commented 1 year ago

Absolutely no problem, Chad. Thanks!

chadharrington commented 1 year ago

@bartkl @danskarda This has been added to the 0.11.0 release of Lancaster. You can read about the ambiguous schema support here: https://github.com/deercreeklabs/lancaster#the-lancasterrecord-name-attribute

Let me know if this fails to meet your needs for some reason. Best wishes, Chad

deercreeklabs / lancaster

Ambiguous union #27