confluentinc / schema-registry

Confluent Schema Registry for Kafka
https://docs.confluent.io/current/schema-registry/docs/index.html
Other
2.23k stars 1.12k forks source link

Issue with compatibility for multi-schema topic with Schema References #1867

Closed yermolovich1987 closed 1 year ago

yermolovich1987 commented 3 years ago

Hello everyone. I didn't find a forum where I can rise this question, so decided to submit it as an issue. I have found a strange behaviour of the Serializer in Java for multi schema topics with Schema Reference usage that  seems to break the Backward Compatibility. The property "use.latest.version=true" is required for MST with schema references to work (this is also mentioned in the documentation - https://docs.confluent.io/platform/current/schema-registry/serdes-develop/serdes-avro.html#multiple-event-types-in-the-same-topic). When this property is set  - Serializer takes the schema version not from a generated class, but looks up from external Schema Registry.  As a result, when we try to produce the old version of the class (even if it is compatible with new one like addition of optional field) we will get an Index Out of Bound error since the amount of field in new schema and in the payload class generated from older schema version is different. Does anybody faced with such issue? Is there any way to overcome it?

I have tried to use latest version of the Serializer and Avro dependencies, but the issue is still reproducible. I have prepared a sample public project to demonstrate the issue: https://github.com/yermolovich1987/kafka-avro-mst-compatibility-demo.

rayokota commented 3 years ago

@yermolovich1987 , thanks for the detailed reproduction.

You've identified a scenario that is not well-supported with use.latest.version=true. Typically in a backward compatible upgrade, you would first upgrade consumers, and then upgrade producers. In a forward compatible upgrade, you would first upgrade producers, and then upgrade consumers. In both cases, when using use.latest.version=true, since the serializers cache the latest schema version, the old producers will continue to operate correctly when a new schema version is registered (they will use the previous latest version). Your scenario is similar to a downgrade, where an old producer incorrectly tries to use the latest schema version (as opposed to the previous version). If you are going to be performing downgrades, then you won't be able to use use.latest.version=true.

dylanmei commented 3 years ago

Here is a race condition between registering the new schema version and updating the producer software. In an organization where the producer (or more correctly, the producing team) is not the one registering schemas with references, what are the most safe and efficient steps to coordinate the evolution of the schema?

yermolovich1987 commented 3 years ago

Hi Robert. Thanks for response and pointing to the caching in serializer (I missed it). Actually, the scenario that I have posted in demo project is just a sample to demonstrate the failure. Regarding the caching - I do agree that it will help with rolling deployment when the new node will use the new schema and old node will use the old cached schema. But this seems more like a workaround, than a robust solution. Since we need to assure that producer has cached the old version somehow and that the corresponding node has not been restarted right before the deployment. Is there a standard way to guarantee that schema in producer is cached? Plus, what should we do if we need to rollback the failed deployment in that case?

Also, I join the question from Dylan - it would be great to know what are the most safe and efficient steps to coordinate the schema evolution in the case that he has mentioned?

Mine understanding was that in all such cases FULL_TRANSITIVE mode will give us an assurance that if schema migration has passed then all will work correctly. But as I can see this is not true for Multi Schema Topics with Schema References. They do require the "use.latest.version=true". That's why I mentioned in the original question that compatibility seems to be broken.

P.S. If this is possible, it would be great to throw a more meaningful exception in such cases (since this is an expected behaviour for "use.latest.version=true"). Because IndexOutOfBound exception took quite a lot of time to debug to figure out what is going on. For example, we could throw an exception telling that schema is stale if the schema in generated class differs from cached/looked up schema.

rayokota commented 3 years ago

One of the reasons for using use.latest.version=true is because Avro does not generate a class for an anonymous union. There is a proposal to add named unions to Avro, but until the proposal is implemented (see https://issues.apache.org/jira/browse/AVRO-248), use.latest.version=true is a workaround which has limitations, as you've pointed out.

The two alternatives to using an anonymous union with TopicNameStrategy are to

  1. switch to using RecordNameStrategy as opposed to TopicNameStrategy
  2. simulate a named union by wrapping the union with an Avro record

I mention both alternatives in the blog https://www.confluent.io/blog/multiple-event-types-in-the-same-kafka-topic/

yermolovich1987 commented 3 years ago

Ok, got this. Thanks a lot! Will definitely check your blog.

dylanmei commented 3 years ago

Regarding the second approach:

simulate a named union by wrapping the union with an Avro record

I believe you are suggesting, until AVRO-248 is addressed, instead of registering references and registering the schema subject as:

[
  "com.demo.avro.TaskCreated",
  "com.demo.avro.TaskCancelled"
]

I should simply register the schema subject as its own Avro record

{
  "type": "Record",
  "name": "TaskEvent",
  "fields": [
    {
      "name": "event",
      "type": [
        "com.demo.avro.TaskCreated",
        "com.demo.avro.TaskCancelled",
        null
      ]
    }
  ]
}

I added this null at the end, assuming this is the only way to ensure I could safely add another event in future. Kindly correct me if I'm wrong.

rayokota commented 3 years ago

@dylanmei , yes that's correct

rayokota commented 3 years ago

@dylanmei , sorry, I meant that yes, that's what I meant about wrapping the union in a record. But the null should not be necessary if you are evolving the schema in a backward compatible manner by adding new types to the union.

dylanmei commented 3 years ago

Thanks, I appreciate the extra clarity, @rayokota !

yermolovich1987 commented 3 years ago

Thanks for detailed responses.