avro-kotlin / avro4k

Avro format support for Kotlin
Apache License 2.0
198 stars 37 forks source link

Support for schema references #120

Closed williamboxhall closed 2 years ago

williamboxhall commented 3 years ago

Avro supports creating a union type of schema references, which reference other separate independently evolving schema files for each element of the union type. Is this something Avro4k might support in the future?

thake commented 2 years ago

@williamboxhall thanks for raising this question. As far as I can tell, schema references are a part of the Confluent schema registry spec. I don't think support for schema references should be included in Avro4k. But maybe I'm just lacking a little bit of imagination :) Can you specify how avro4k could help handle schema references for Confluent schema registry?

GreyTeardrop commented 2 years ago

H @thake! Sorry for possibly hijacking this issue. I have a request that might be related to what @williamboxhall has requested initially. Avro by itself supports referencing objects from other schemas loaded into the same context, e.g.:

Schema A:

{
  "type" : "record",
  "name" : "A",
  "namespace" : "test",
  "fields" : [ {
    "name" : "z",
    "type" : "string"
  }, {
    "name" : "v",
    "type" : "long"
  } ]
}

Schema B that references A:

{
  "type" : "record",
  "name" : "B",
  "namespace" : "test",
  "fields" : [ {
    "name" : "x",
    "type" : "int"
  }, {
    "name" : "y",
    "type" : "A"
  } ]
}

If schema A is loaded into org.apache.avro.Schema.Parser before schema B is parsed, Avro can successfully resolve the reference from A to B.

It would be awesome if there was a way to make a reverse operation with Avro4k: given classes A and B with B referencing A, be able to generate a schema for A and then a schema for B that references A instead of embedding it.

GreyTeardrop commented 2 years ago

I've looked more into the code, and it seems like Avro's native SchemaBuilder does not support referencing other schemas. The schema I've posted above can be parsed, but when dumped, it would embed the nested object instead of referencing it, so there's probably no easy way to dump schemas that refer to other schemas.

thake commented 2 years ago

@GreyTeardrop thanks for pointing out a case of native schema referencing in avro! I haven't been aware of that.

Do you have a specific use case for avro4k in mind where this could be useful? Again, it seems I'm lacking some imagination :innocent:

GreyTeardrop commented 2 years ago

The thought was to keep Kotlin models as a source of Avro schemas, but generate JSON schemas for clients that can't directly consume Kotlin models. Ideally to make those JSON schemas easier readable not only for software but also for developers, they might reference included sub-schemas instead of nesting them.

This isn't a really strong usecase - it's perfectly possible to live with dumping schemas the way they are right now.

thake commented 2 years ago

@GreyTeardrop, thanks for your input on the use case. This seems to be a totally valid use case if you chose the code-first approach to managing schemas.

I did some research on this matter in the Avro bug archives. It seems like the Avro team concisely decided not to include schema referencing into Avro schemas. The Avro team favored a pre-processor approach instead (see https://issues.apache.org/jira/browse/AVRO-1188?focusedCommentId=13490809&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13490809).

As there is no native Avro support for explicitly referencing schemas, I wouldn't want to include native support for schema separation in avro4k. You can, however, write up some post-processing that splits the resulting schema into multiple files. It should be an easy task. I think one could develop a little open source library that only depends on the Avro library. Would be cool to see it.