confluentinc / schema-registry

Confluent Schema Registry for Kafka
https://docs.confluent.io/current/schema-registry/docs/index.html
Other
2.22k stars 1.11k forks source link

Registering a schema by version #540

Open KeithWoods opened 7 years ago

KeithWoods commented 7 years ago

TLDR; It’d be fantastic if you the schema version could be owned externally (i.e. a build version), and the register calls took the version rather than returned it.

Background

We're building a system whereby some clients don't have access to the schema registry. For example, we can't have thousands of web GUIs hitting the scheme registry. These clients need to serialize/deserialize avro messages, then on the receiving side we need to retrieve the schema (via the schema registry) for deserialization.

On the receiving side the schema registry allows us to get a schema either by id or subject. If by id a scheme registry clients can implement a cache so you don't hit the rest api on every call to receive the scheme. If by subject you obviously have to hit the rest API as a schema registry client would never know it's the latest (unless the client implemented a near cache, which the one we're using io.confluent.kafka.schemaregistry.client one doesn't).

Given the above I see 2 ways external clients, which can't access the schema registry, can serialise and have internal services deserialize. Both of which have a problem:

1) Have the external clients serialise the schema id with the serialised avro bytes

This is the traditional approach whereby the external client somehow has to get the schema id and encode it so the receiving side can deserialize it in an efficient manner (i.e. using a near cache). The problem with this is the schema id is a runtime concept. At built and deploy time you don't know the version of the schema. You can't build external clients and encode the correct schema id. As already mentioned, external clients can’t hit the rest API.

2) Have the external clients serialise the schema subject with the serialised avro bytes

This method allows the clients to be decoupled from the schema id and thus schema registry. The subject doesn't change over time (else it's a breaking change). On the receiving side the only option is to get the latest schema id via a rest call (which isn't an option for performance reasons). You throw many aspects of schema evolution out the window and receiving services must be able to process the latest version of the schema.

Both approaches are non-optional.

Possible Solution

I think if the schema registry allows you to put a specific version of a schema rather than letting the registry own the version the above problems would go away. You could implement the first option in an efficient manner as versions are known at compile time. A deployment step can registry all new schemas by specific version, or services can register schemas by version as they come online.

Am I missing something? Is this supported or has it been requested?

Thanks in advance

raybooysen commented 7 years ago

There are some solutions to this, where the client KNOWS the Id of the schema in it's code. however, that would mean coupling the build of the client to the production instance of the schema registry.

Having the schema registry own the Id, to me, is the problem and should just be the concern of whoever is pushing to the registry.

mageshn commented 7 years ago

I'm trying to understand the concern about the clients accessing the Schema Registry. I can better help if understand the concern. Because Schema Registry is designed to work end to end where both producer & consumer are hooked up to it.

raybooysen commented 7 years ago

Hi @mageshn

The first comment has a good breakdown, what additional information would you like to discuss?

ferozed commented 5 years ago

I second this request. It would be great to have the schema version be specified by the registering client, as opposed to being generated by the registry. Also, it would be doubly nice if the schema version is a string rather than an integer.

nick-zh commented 5 years ago

@mageshn i think the main problem that could be solved, by being able to set a specific version for your schema, is that you give the power back to the application. Right now, neither the schema id nor the version can be known (unless i am mistaken), unless you register it first. Let's say if i deploy my application and schema(s) in one go, i wouldn't be able to tie my producer / consumer already to that version easily. I would actually need to register my schema(s) first and then feed back the version / schema id information. Like @ferozed already mentioned, i think it would be a great addition, that the version is a string and not an integer