A series of pod restarts followed by large number of exceptions in the logback errors have been observed. The latencies during these occurrences are also quite significant. During peak hours we are seeing a lot of timeout exceptions in the format as below:
org.apache.kafka.common.errors.TimeoutException: Expiring 5 record(s) for eventhub-partitionId:120000 ms has passed since batch creation
We are seeing this exception in 10000s of count during the peak hours. Consequently, we do see our kubernetes pod restarts because of these exceptions and thus causing the latency. We are currently using the Microsoft recommended configurations and using a dedicated event hub cluster with 12 CU.
ALA logs show exceptions thrown during this time duration. All the logs have the same Timeout Exception, across various event hub partitions.
We have the following configured in our system now:
Our system was working successfully without any errors until 5-April-2023. After that we started seeing this error.
Checklist
IMPORTANT: We will close issues where the checklist has not been completed or where adequate information has not been provided.
Please provide the relevant information for the following items:
[x] SDK (include version info): 17
[x] Sample you're having trouble with: NA
[x] If using Apache Kafka Java clients or a framework that uses Apache Kafka Java clients, version: <REPLACE with e.g., 1.1.0>
[x] Kafka client configuration: provided above
[x] Namespace and EventHub/topic name: Dedicated tier, Cannot share the details in the open forum
[x] Consumer or producer failure: producer failure
[x] Timestamps in UTC `Since 05-April on peak EMEA hours
[x] group.id or client.id <REPLACE with e.g., group.id=cg-name>
[X] Logs provided (with debug-level logging enabled if possible, e.g. log4j.rootLogger=DEBUG) or exception call stack:
org.apache.kafka.common.errors.TimeoutException: Expiring 9 record(s) for<>-83:120000 ms has passed since batch creation
[x] Standalone repro : NA
[x] Operating system: 'linux-version unknown'
[x] Critical issue
If this is a question on basic functionality, please verify the following:
[x] Port 9093 should not be blocked by firewall ("broker cannot be found" errors)
[x] Pinging FQDN should return cluster DNS resolution (e.g. $ ping namespace.servicebus.windows.net returns ~ ns-eh2-prod-am3-516.cloudapp.net [13.69.64.0])
[x] Namespace should be either Standard or Dedicated tier, not Basic (TopicAuthorization errors)
Description
A series of pod restarts followed by large number of exceptions in the logback errors have been observed. The latencies during these occurrences are also quite significant. During peak hours we are seeing a lot of timeout exceptions in the format as below:
org.apache.kafka.common.errors.TimeoutException: Expiring 5 record(s) for eventhub-partitionId:120000 ms has passed since batch creation
We are seeing this exception in 10000s of count during the peak hours. Consequently, we do see our kubernetes pod restarts because of these exceptions and thus causing the latency. We are currently using the Microsoft recommended configurations and using a dedicated event hub cluster with 12 CU.
ALA logs show exceptions thrown during this time duration. All the logs have the same Timeout Exception, across various event hub partitions.
We have the following configured in our system now:
spring.cloud.stream.kafka.binder.configuration.metadata.max.age.ms=180000 spring.cloud.stream.kafka.binder.configuration.connections.max.idle.ms=180000 spring.cloud.stream.kafka.binder.configuration.max.request.size=1000000 spring.cloud.stream.kafka.binder.configuration.retries=0 spring.cloud.stream.kafka.binder.configuration.request.timeout.ms=60000 spring.cloud.stream.kafka.binder.configuration.linger.ms=150 spring.cloud.stream.kafka.binder.configuration.delivery.timeout.ms=120000 spring.cloud.stream.kafka.binder.configuration.max.poll.records=500
How to reproduce
These errors are happening for past 2 weeks now and we see this during peak hours, followed by multiple pods restarting and causing latency.
We tried to go through your recommendation provided in the following stackoverflow link, but we are already using the recommended configurations.
https://stackoverflow.com/questions/58010247/azure-eventhub-kafka-org-apache-kafka-common-errors-timeoutexception-for-some-of
Has it worked previously?
Our system was working successfully without any errors until 5-April-2023. After that we started seeing this error.
Checklist
IMPORTANT: We will close issues where the checklist has not been completed or where adequate information has not been provided.
Please provide the relevant information for the following items:
[x] SDK (include version info):
17
[x] Sample you're having trouble with:
NA
[x] If using Apache Kafka Java clients or a framework that uses Apache Kafka Java clients, version:
<REPLACE with e.g., 1.1.0>
[x] Kafka client configuration: provided above
[x] Namespace and EventHub/topic name: Dedicated tier, Cannot share the details in the open forum
[x] Consumer or producer failure: producer failure
[x] Timestamps in UTC `Since 05-April on peak EMEA hours
[x] group.id or client.id
<REPLACE with e.g., group.id=cg-name>
[X] Logs provided (with debug-level logging enabled if possible, e.g. log4j.rootLogger=DEBUG) or exception call stack: org.apache.kafka.common.errors.TimeoutException: Expiring 9 record(s) for<>-83:120000 ms has passed since batch creation
[x] Standalone repro : NA
[x] Operating system: 'linux-version unknown'
[x] Critical issue
If this is a question on basic functionality, please verify the following:
$ ping namespace.servicebus.windows.net
returns ~ns-eh2-prod-am3-516.cloudapp.net [13.69.64.0]
)