rystsov commented 2 years ago

Description

Observed a pathological behavior which is missing in other clients (java's kafka-clients & franz-go).

With disabled transactions when any client (including confluent_kafka) is about to send init_producer_id it doesn't use txn coordinator, picks a broker randomly and sends a request there. If the request fails resulting in broker returning a retry-able error kafka-clients & franz-go restart the procedure from scratch, pick a random broker again and use it to retry the request while confluent_kafka is reusing the original broker. In case the problem with the broker is permanent kafka-clients & franz-go has a chance to choose a healthy node while confluent_kafka gets stuck.

How to reproduce

I've used Redpanda but based on the source code the same problem affects Kafka too (not sure if Kafka returns retry-able error for init_producer_id but it returns unknown_server_error which confluent_kafka treats the same way and retries). The repro step:

have a 3 node cluster
isolate a node
start producing record
if the producer sends init_producer_id to the isolated node it gets stuck

Client logs (docker-rp-4 was isolated):

%4|1651850857.547|GETPID|rdkafka#producer-7| [thrd:main]: Failed to acquire idempotence PID from broker docker-rp-4:9092/1: Broker: Not coordinator: retrying
%4|1651850860.049|GETPID|rdkafka#producer-7| [thrd:main]: Failed to acquire idempotence PID from broker docker-rp-4:9092/1: Broker: Not coordinator: retrying
%4|1651850862.550|GETPID|rdkafka#producer-7| [thrd:main]: Failed to acquire idempotence PID from broker docker-rp-4:9092/1: Broker: Not coordinator: retrying
%4|1651850863.050|GETPID|rdkafka#producer-7| [thrd:main]: Failed to acquire idempotence PID from broker docker-rp-4:9092/1: Broker: Not coordinator: retrying
%4|1651850873.155|GETPID|rdkafka#producer-7| [thrd:main]: Failed to acquire idempotence PID from broker docker-rp-4:9092/1: Broker: Not coordinator: retrying
%4|1651850883.259|GETPID|rdkafka#producer-7| [thrd:main]: Failed to acquire idempotence PID from broker docker-rp-4:9092/1: Broker: Not coordinator: retrying
%4|1651850886.100|GETPID|rdkafka#producer-7| [thrd:main]: Failed to acquire idempotence PID from broker docker-rp-4:9092/1: Broker: Not coordinator: retrying

Checklist

Please provide the following information:

[ ] confluent-kafka-python version: 1.8.2
[ ] Apache Kafka broker version: Redpanda v22.1.2
[ ] Client configuration: {"bootstrap.servers": ..., "request.required.acks": -1, "retries": 5, "enable.idempotence": True }
[ ] Operating system: ubuntu 20.04
[ ] Provide client logs
[ ] Provide broker log excerpts
[ ] Critical issue

mhowlett commented 2 years ago

@edenhill - seems like something we could improve

edenhill commented 2 years ago

Good find! Created upstream librdkafka issue: https://github.com/edenhill/librdkafka/issues/3848

mhowlett commented 2 years ago

tracking this in the above mentioned librdkafka issue.

confluentinc / confluent-kafka-python

Producer gets stuck in init_producer_id loop #1335

Description

How to reproduce

Checklist