Schema ID strategy complicates deployment topologies

bcotton commented 9 years ago

Reviewing the architecture of the schema-registry, specifically in a multi datacenter deployment where there can only be a single master node, leads to overly complex deployment architectures.

Schema IDs are global, monotonically increasing integers that can only be handed out by a single master instance of the schema-registry. This complicates deployment topologies in both hot-cold and hot-hot datacenter configurations. The hot-hot scenario can only have a single master schema registry to which the other datacenter(s) must communicate over the WAN. In hot-cold scenarios, additional configuration must be orchestrated when moving datacenters from hot -> cold and back. This gets even more complicated as the number of datacenters increases.

Currently there is no implicit order to the schema IDs. The only order needed in the system is order of the schemas as applied to Subjects. Message ordering in the _schemas topic seems to enforce this order already.

It would seem that generating global schema ID in a stateless way would simplify the deployment topologies by allowing any process to register new schemas and publish them to Kafka as long as the schema ID were in fact globally unique. This would all but alleviate the need for a separate schema-registry process and the schema-registry REST end-points could be folded into the kafka-rest process.

Schema ID generation could be based on the Avro fingerprint of the canonical parsing form of the schema. If schema equality is done simply by string equality, then a UUID could be returned from the schema-registry.

This has two drawbacks, 1) the schema ID would be larger than an encoded integer, and there could be race conditions for subject ordering based on mirror-maker lags between datacenters.

I would like to get your thoughts on this.

Thanks

junrao commented 9 years ago

Bob,

Thanks for reporting this. Just a few clarifications on the current design.

In our api, each schema does gets a unique id based on the MD5 of the canonicalized schema string. So, if the same schema is registered under different subjects, you will get back the same schema id.
When registering a schema under a subject, each schema also gets a version id within that subject. Since we need to check the compatibility of the new schema with the existing ones, the ordering in which those schemas are registered is important. The easiest way to enforce this is to have a single master design---the master decides the ordering of the schemas registered in a subject and the ordering in which the registered schemas are written to the Kafka log. If we allow new schemas to be registered on more than one schema registry instance, non-compatible schemas can leak into the system.
Note that schema registry server is used rarely since the schema registry client does caching. So, with a single master in a multi-DC setup, the WAN should be used rarely.
I agree that the current multi-DC strategy is a bit complicated, but guaranteeing the ordering within a subject is important for the compatibility check. In certain cases, the deployment strategy can be made a simpler. For example, if one wants to push new schema changes manually, potentially we can have a file based storage engine in schema registry. Users can then deploy a separate schema registry per DC, but they are responsible for copying the latest schema files into each DC and make them consistent.

Let us know if you have more thoughts on this.

Thanks,

ppearcy commented 8 years ago

This issue is a little old, but I wanted to chime in.

1) Would it be possible to include the MD5 / fingerprint as a first class attribute and the ability to lookup a schema via the fingerprint?

2) The version id could still be generated by a master and always incrementing. But at this point, would be too painful to get rid of the schema id identity, although, a new serialization format with different magic byte could potentially phase it out.

This could effectively remove the schema registry dependency from the client side, if you can guarantee all schemas are registered server side and are OK without the pre-send validation.

ewencp commented 8 years ago

Hey @ppearcy.

Part 1 definitely seems doable, although I'm sure there could be plenty of bikeshedding over what to choose as the hash. If something was included here I'd a) suggest we actually make it explicit in various requests (e.g. if we used md5, it's not just called "fingerprint", but "md5" so we have room for extension/alternatives) and b) we try to make sure it is as well supported across languages as possible (I'm guessing that means md5/sha functions, but maybe things like murmur could work too).

But overall, I think part 1 only requires computing these fingerprints, including them in lookups, and adding an API endpoint for lookup via those fingerprints. Given the generally small data sizes involved in the schema registry, I'm assuming the index for that lookup wouldn't be an issue.

Part 2 is definitely a lot more complicated. We haven't discussed evolving the format -- the magic byte is there so we can support it, but I think we'd want really thorough thought put into a new format before committing to it. The existing format was proven to work well in a large organization with gating on clients based on successfully registering/validating schemas.

I get that requiring all clients to validate against the schema registry introduces a runtime requirement and potential source of failure, but is there a reason you're really motivated to remove that dependency? In general, schema registry is not a highly-loaded system, has a good HA story, and isn't operationally complex. And you still want the service to exist so you still pay the cost of running it -- it just seems you are trying to take it out of the critical path of app startup. Can you help me understand why the impact of removing that dependency would be so big for you?

yushuyao commented 8 years ago

Strongly agree with @ppearcy about 1). One of our use cases has message-level schema for binary messages. We look up the schema (and cache it) when we see each message.

Of course we can include the "schema registry generated ID" in the message, but it becomes complicated when we also want an R&D schema registry in a separate environment.

ewencp commented 8 years ago

@yushuyao I'm not sure how the scenario you are describing differs from how we already handle schemas? Or I guess what scenario you are trying to handle by not including the schema ID?

You mention "message-level schema for binary messages" and that you "look up the schema (and cache it)". This sounds like normal schema registry/serializer behavior.

I think what you're getting at re: R&D schema is that you want to create some messages that aren't tied to schema IDs in your production environment? I'm trying to figure out how this data ultimately gets processed. When you mention R&D schemas, do you just mean schemas that are defined in your code, you don't care about compatibility for them, and the data generated with them is considered disposable for the time being?

If that's the case, I think maybe you could just omit the schema registry entirely -- at the point it seems like you are already assuming a shared knowledge of the schema between producer and consumer. In that case it sounds like you want a variant of the serializer which doesn't include any ID/fingerprint/schema identification, but rather assumes that the application knows the exact schema. That doesn't leave any room for schema evolution, but might be handy just for quick testing where you don't even care about data compatibility or longevity. Is that a correct assessment of what you are trying to accomplish?

bcotton commented 8 years ago

I think what @yushuyao is saying about R&D, at least the case we are looking at; schemas are born in R&D and evolve/change faster than production. They initially start with no forward/backward compatibility checking to help with initial development.

We don't want those schemas polluting the production registry.

bcotton commented 8 years ago

@ppearcy, @ewencp Keep in mind what avro thinks is the "fingerprint" of a schema is based on the conical parsing form of the schema, that is the schema with everything "extra" stripped away then normalized. If you encode other metadata in to the schema or use logical types, that data is not included in Avro's fingerprint. This becomes an issue because there may be "changes" to the schema, but the avro fingerprint will not change.

Confluent's schema registry generated ID's are based on this fingerprint.

ppearcy commented 8 years ago

@ewencp Agree would want to call it md5 vs fingerprint, no opinion on hash mechanism other than a +1 for md5. And you can pretty much ignore point 2). Was mostly commenting that it was a possibility.

@bcotton Yeah, no surprises there.

My use case is different from others. I need to support clients producing data from remote locations from client servers that are not under my control and do not have access to the schema registry. The schemas that are available to be produced are locked down and known ahead of time (and already validated for compatibility).

The Kafka level security is too cumbersome operationally for me to use and these servers already have the ability to authenticate to a central service that I control.

My plan is to use Amazon Kinesis to bridge events to my Kafka instances and to allow STS keys to be provisioned from from my central service. Right now, I have to send messages via the avro object container file format that embeds the schema and bloats the messages.

Instead, I'd prefer to just include the fingerprint as part of a header. Then I can consume out of Kinesis and map the fingerprint to the schema registry id and produce back into kafka.

I have another alternative which could be to proxy schema registry information via the central service, but this then forces me to run the unified multi-region master schema registry, which is an operational pain.

cbsmith commented 8 years ago

Seems like we've got a lot of pain here just to avoid duplicative schemas being created.

Wouldn't it be better to resolve that in an eventually consistent fashion? Have the schema ID's be a UUID's that ignore content hashing. As long as the schema is registered in a canonical form, topic consumers can identify duplicates to avoid paying a storage penalty for duplicates. The comparison is likely faster (no need to compute an MD5 hash) and collisions can't be induced (unless you use a type 5 uuid, but then... why?!?! ;-).

Any given schema master can dedupe before allowing updates to avoid excessive schema revisions (the existing models already store the schema in memory, so having a hashtable with the actual schema object as the key is likely lower memory overhead than with the MD5 hash key), but with multiple masters you just end up with two different UUID's that actually point to the same schema object. If you want to get really clever you can go through the schema history and replace version entries to point to only one of the two duplicate UUID's (you'd still need to keep the mapping from the other UUID to a schema, but at least clients could easily tell they are looking at the same UUID).

Worst case scenario is if you have a race condition between two masters, and all that burns up is a bit more bandwidth. In exchange though, you can easily have multiple masters, and even have independent schema stores (for example, independent Kafka clusters) as long as they have some kind of asynchronous replication between them. That seems like a very reasonable trade off for having a way, way simpler and reliable model for HA.

It'd also allow for easy use of AP (in CAP terms) backing stores that span data centers and handle the whole HA problem for you (Cassandra comes to mind immediately).

ki11roy commented 7 years ago

Revive this issue a bit. Would it be possible to append ID as an optional parameter of registration query to have some control over generation process or just use that internal md5 as ID which is already here to check for duplicates. In my case latest schemas are always on the producer side (shipped with it) and they are registering every launch of the application. Also there are multiple regional schema registries. On the first launch shemas register and got IDs which are different for every region, so consumers can't merge data from the different regions. It is a bit problematic to have common registry because of distributed nature of the service. How about this improvement?

cbsmith commented 7 years ago

I think consumers could merge data from different regions, so long as they resolved with the registry from the appropriate region, no?

Freds72AtWork commented 6 years ago

Same here - had to fork the Schema Registry to use topic+fingerprint as the key (instead of topic+version). The version as a 32bits integer is a major weak point for consistent deployment (not to mention ability to read messages from production in a different environment...).

cbsmith commented 5 years ago

I think 32-bits works fine with a decent hash, given that you're already scoped by topic/subject. 64-bits would be better, but frankly the odds of a collision are ridiculously low even if you change the schema 10000 times... and even then some basic reasoning based on timestamps can resolve any ambiguity.

confluentinc / schema-registry

Schema ID strategy complicates deployment topologies #250