Azure / azure-event-hubs-for-kafka

Azure Event Hubs for Apache Kafka Ecosystems
https://docs.microsoft.com/azure/event-hubs/event-hubs-for-kafka-ecosystem-overview
Other
231 stars 213 forks source link

EventHub timeouts leads to Kafka producer clients getting stuck and always failing with `InvalidPidMappingException` (with default `enable.idempotence=true`) #261

Open lgo opened 3 days ago

lgo commented 3 days ago

I'm reporting this as I've hit this a number of times and while I've worked around it, I'm filing this for two reasons:

I have a relatively low set of timeouts configured provided specific requirements on some topics, with the following Kafka producer client configuration:

retries=1
linger.ms=2
request.timeout.ms=5000
delivery.timeout.ms=10011 # (request timeout * (retry + 1) + linger + 1)

In several situations (e.g. EventHub server restarts due to upgrades, excess consumer load hammering EventHub), we've observed that after we have any timeouts the Kafka producer client will get stuck and always fail with the following error:

org.apache.kafka.common.errors.InvalidPidMappingException: The producer attempted to use a producer id which is not currently assigned to its transactional id.

The Kafka client in this situation will not self-recover, even if EventHub has recovered. Recovery is manual, through re-initialize the Kafka producer client. Of course, this only occurs with the default Kafka setting of enable.idempotence=true which introduces client transaction IDs. I've found this easy to reproduce by inducing a high load on EventHub such as having an amplified Kafka consumption rate, say a consumer deployed 100s of times or a Spark streaming job with many tasks.

lgo commented 2 days ago

Ah, seems like KIP-588 is relevant. It doesn't seem to be resolved, but does have a couple changes related to it. It's still a mystery to me why we're only seeing this for EventHub when it's likely we're also seeing timeouts on other clients but without thel asting impact.