I'm reporting this as I've hit this a number of times and while I've worked around it, I'm filing this for two reasons:
This feels like a bug with interaction between EventHub and the Kafka client provided I have not encountered this in similar situations with Kafka, albeit I'm not even sure sure I've seen timeouts on actual Kafka deployments with our setup as we have transparent retries
For anyone else who runs into this mysterious problem, hopefully you find this and can resolve the issue
I have a relatively low set of timeouts configured provided specific requirements on some topics, with the following Kafka producer client configuration:
In several situations (e.g. EventHub server restarts due to upgrades, excess consumer load hammering EventHub), we've observed that after we have any timeouts the Kafka producer client will get stuck and always fail with the following error:
org.apache.kafka.common.errors.InvalidPidMappingException: The producer attempted to use a producer id which is not currently assigned to its transactional id.
The Kafka client in this situation will not self-recover, even if EventHub has recovered. Recovery is manual, through re-initialize the Kafka producer client. Of course, this only occurs with the default Kafka setting of enable.idempotence=true which introduces client transaction IDs. I've found this easy to reproduce by inducing a high load on EventHub such as having an amplified Kafka consumption rate, say a consumer deployed 100s of times or a Spark streaming job with many tasks.
Ah, seems like KIP-588 is relevant. It doesn't seem to be resolved, but does have a couple changes related to it. It's still a mystery to me why we're only seeing this for EventHub when it's likely we're also seeing timeouts on other clients but without thel asting impact.
I'm reporting this as I've hit this a number of times and while I've worked around it, I'm filing this for two reasons:
I have a relatively low set of timeouts configured provided specific requirements on some topics, with the following Kafka producer client configuration:
In several situations (e.g. EventHub server restarts due to upgrades, excess consumer load hammering EventHub), we've observed that after we have any timeouts the Kafka producer client will get stuck and always fail with the following error:
The Kafka client in this situation will not self-recover, even if EventHub has recovered. Recovery is manual, through re-initialize the Kafka producer client. Of course, this only occurs with the default Kafka setting of
enable.idempotence=true
which introduces client transaction IDs. I've found this easy to reproduce by inducing a high load on EventHub such as having an amplified Kafka consumption rate, say a consumer deployed 100s of times or a Spark streaming job with many tasks.