EventHub timeouts leads to Kafka producer clients getting stuck and always failing with `InvalidPidMappingException` (with default `enable.idempotence=true`)

I'm reporting this as I've hit this a number of times and while I've worked around it, I'm filing this for two reasons:

This feels like a bug with interaction between EventHub and the Kafka client provided I have not encountered this in similar situations with Kafka, albeit I'm not even sure sure I've seen timeouts on actual Kafka deployments with our setup as we have transparent retries
For anyone else who runs into this mysterious problem, hopefully you find this and can resolve the issue

I have a relatively low set of timeouts configured provided specific requirements on some topics, with the following Kafka producer client configuration:

retries=1
linger.ms=2
request.timeout.ms=5000
delivery.timeout.ms=10011 # (request timeout * (retry + 1) + linger + 1)

In several situations (e.g. EventHub server restarts due to upgrades, excess consumer load hammering EventHub), we've observed that after we have any timeouts the Kafka producer client will get stuck and always fail with the following error:

org.apache.kafka.common.errors.InvalidPidMappingException: The producer attempted to use a producer id which is not currently assigned to its transactional id.

The Kafka client in this situation will not self-recover, even if EventHub has recovered. Recovery is manual, through re-initialize the Kafka producer client. Of course, this only occurs with the default Kafka setting of enable.idempotence=true which introduces client transaction IDs. I've found this easy to reproduce by inducing a high load on EventHub such as having an amplified Kafka consumption rate, say a consumer deployed 100s of times or a Spark streaming job with many tasks.

Azure / azure-event-hubs-for-kafka

EventHub timeouts leads to Kafka producer clients getting stuck and always failing with `InvalidPidMappingException` (with default `enable.idempotence=true`) #261