Apicurio / apicurio-registry

An API/Schema registry - stores APIs and Schemas.
https://www.apicur.io/registry/
Apache License 2.0
607 stars 269 forks source link

Apicurio registry 2.4.3 and 2.4.4 causes endless rebalance loop on Kafka #3662

Open obabec opened 1 year ago

obabec commented 1 year ago

Apicurio version: 2.4.3 and 2.4.4 Openshift version: 4.12 Kafka version: 3.4.0, 3.5.0 Persistent type: kafkasql Convertor: SerdeAvro

The issue is reproducible when running the Debezium test suite against openshift 4.12. The test suite uses installation via OLM so you have to make the necessary changes that would allow you to install the operator via operator bundle to hit the issue.

Once tests connect the consumer to the plain listener on Kafka (the same one that Apicurio uses) the consumer falls into an endless rebalance loop. We have tried version 2.3.0, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4. The issue occurs in version >= 2.4.3.

The issue is 100% reproducible in this environment.

This issue is just a follow-up to discussions that started on Monday.

Edited as follow-up of Jiri's comment

novotnyJiri commented 1 year ago

Additional info! Due to my mistake in testing runs we wrongly assumed TLS on version 2.4.3 works. That is not true. It only works on version 2.4.2 and probably all earlier versions.

carlesarnal commented 8 months ago

@jsenko I think this has been addressed already, no?

jsenko commented 8 months ago

Not yet

jcechace commented 2 months ago

Any progress on this? Downstream wise this is actually a pretty significant issue.

carlesarnal commented 2 months ago

2.4.3.Final introduced headers support to the converters. Unfortunately, since they're based on our serdes, due to a bug where the default value for enabling the headers was mistakenly set to true, the converter started using headers instead of sending the magic byte and the schemaId in the message payload. To revert that behaviour, apicurio.registry.headers.enabled has to be set to false in the connector configuration. That is a workaround, not a fix. In 3.0, this has been addressed and the default value has been correctly set. We cannot change it back in 2.6.x since there are existing consumers relying on this, so our best option for older versions is very likely to document this.

That said, not using headers just hides the problem since the underlying issue with the headers usage would still be there, so we have to figure out what is being done with the headers that is causing the problem.

FYI @jsenko

jcechace commented 2 months ago

@obabec ^^