Observed a pathological behavior which is missing in other clients (java's kafka-clients & franz-go).
With disabled transactions when any client (including confluent_kafka) is about to send init_producer_id it doesn't use txn coordinator, picks a broker randomly and sends a request there. If the request fails resulting in broker returning a retry-able error kafka-clients & franz-go restart the procedure from scratch, pick a random broker again and use it to retry the request while confluent_kafka is reusing the original broker. In case the problem with the broker is permanent kafka-clients & franz-go has a chance to choose a healthy node while confluent_kafka gets stuck.
How to reproduce
I've used Redpanda but based on the source code the same problem affects Kafka too (not sure if Kafka returns retry-able error for init_producer_id but it returns unknown_server_error which confluent_kafka treats the same way and retries). The repro step:
have a 3 node cluster
isolate a node
start producing record
if the producer sends init_producer_id to the isolated node it gets stuck
Client logs (docker-rp-4 was isolated):
%4|1651850857.547|GETPID|rdkafka#producer-7| [thrd:main]: Failed to acquire idempotence PID from broker docker-rp-4:9092/1: Broker: Not coordinator: retrying
%4|1651850860.049|GETPID|rdkafka#producer-7| [thrd:main]: Failed to acquire idempotence PID from broker docker-rp-4:9092/1: Broker: Not coordinator: retrying
%4|1651850862.550|GETPID|rdkafka#producer-7| [thrd:main]: Failed to acquire idempotence PID from broker docker-rp-4:9092/1: Broker: Not coordinator: retrying
%4|1651850863.050|GETPID|rdkafka#producer-7| [thrd:main]: Failed to acquire idempotence PID from broker docker-rp-4:9092/1: Broker: Not coordinator: retrying
%4|1651850873.155|GETPID|rdkafka#producer-7| [thrd:main]: Failed to acquire idempotence PID from broker docker-rp-4:9092/1: Broker: Not coordinator: retrying
%4|1651850883.259|GETPID|rdkafka#producer-7| [thrd:main]: Failed to acquire idempotence PID from broker docker-rp-4:9092/1: Broker: Not coordinator: retrying
%4|1651850886.100|GETPID|rdkafka#producer-7| [thrd:main]: Failed to acquire idempotence PID from broker docker-rp-4:9092/1: Broker: Not coordinator: retrying
Description
Observed a pathological behavior which is missing in other clients (java's
kafka-clients
&franz-go
).With disabled transactions when any client (including
confluent_kafka
) is about to sendinit_producer_id
it doesn't use txn coordinator, picks a broker randomly and sends a request there. If the request fails resulting in broker returning a retry-able errorkafka-clients
&franz-go
restart the procedure from scratch, pick a random broker again and use it to retry the request whileconfluent_kafka
is reusing the original broker. In case the problem with the broker is permanentkafka-clients
&franz-go
has a chance to choose a healthy node whileconfluent_kafka
gets stuck.How to reproduce
I've used Redpanda but based on the source code the same problem affects Kafka too (not sure if Kafka returns retry-able error for
init_producer_id
but it returnsunknown_server_error
whichconfluent_kafka
treats the same way and retries). The repro step:init_producer_id
to the isolated node it gets stuckClient logs (
docker-rp-4
was isolated):Checklist
Please provide the following information:
{"bootstrap.servers": ..., "request.required.acks": -1, "retries": 5, "enable.idempotence": True }