confluentinc / librdkafka

The Apache Kafka C/C++ library
Other
284 stars 3.15k forks source link

LeaveGroup bug #4402

Closed wolfchimneyrock closed 12 months ago

wolfchimneyrock commented 1 year ago

Description

Upgrading our Kafka Brokers to 3.4.1 we start seeing some UnsupportedVersionExceptions in the broker logs:

[2023-08-21 21:50:47,887] ERROR [KafkaApi-50478] Unexpected error handling request RequestHeader(apiKey=LEAVE_GROUP, apiVersion=1, clientId=rdkafka, correlationId=50, headerVersion=1) -- LeaveGroupRequestData(groupId='<REDACTED>', memberId='rdkafka-72bc6db8-0909-4851-bf7e-514e3cdef376', members=[]) with context RequestContext(header=RequestHeader(apiKey=LEAVE_GROUP, apiVersion=1, clientId=rdkafka, correlationId=50, headerVersion=1), connectionId='<REDACTED>', clientAddress=<REDACTED>, principal=<REDACTED>, listenerName=ListenerName(PLAINTEXT), securityProtocol=PLAINTEXT, clientInformation=ClientInformation(softwareName=confluent-kafka-python, softwareVersion=2.2.0-rdkafka-2.2.0), fromPrivilegedListener=false, principalSerde=Optional[<REDACTED>]) (kafka.server.KafkaApis)
java.util.concurrent.CompletionException: org.apache.kafka.common.errors.UnsupportedVersionException: LeaveGroup response version 1 can only contain one member, got 0 members.
    at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315)
    at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320)
    at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:936)
    at java.base/java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:950)
    at java.base/java.util.concurrent.CompletableFuture.handle(CompletableFuture.java:2340)
    at kafka.server.KafkaApis.handleLeaveGroupRequest(KafkaApis.scala:1796)
    at kafka.server.KafkaApis.handle(KafkaApis.scala:196)
    at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:75)
    at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: org.apache.kafka.common.errors.UnsupportedVersionException: LeaveGroup response version 1 can only contain one member, got 0 members.

This is happen relatively infrequently, from what I can tell there is nothing special about the consumer configuration:

config = {
    "bootstrap.servers": BROKER_ENDPOINT,
    "group.id": CONSUMER_GROUP_NAME,
    "enable.partition.eof": False,
}

A similar issue was raised and fixed on Sarama:

https://github.com/IBM/sarama/issues/2486

In which they implemented version 3 of the LeaveGroup protocol. I suspect that the Kafka Broker is no longer concerned about correctly handling LeaveGroup v0 - v1 requests in all cases.

Also, it appears according to the kafka protocol for LeaveGroup that librdkafka is incorrectly parsing the LeaveGroup response for ApiVersion 1: There should be a ThrottleTime int32 before the ErrorCode int16. This is likely the cause of some flaky test results we've experienced.

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

emasab commented 1 year ago

Hello @wolfchimneyrock librdkafka is parsing ThrottleTime correctly for version 1, here.

I think you have checked rd_kafka_handle_LeaveGroup but that code is never used and needs to be removed.

That java exception corresponds to this code. But it doesn't seem correct as in version 1 there are no members to write, not only one.

It should be faster to fix broker side. In LK we're currently focused on the new consumer group protocol that doesn't have LeaveGroup.