Azure / azure-event-hubs-for-kafka

Azure Event Hubs for Apache Kafka Ecosystems
https://docs.microsoft.com/azure/event-hubs/event-hubs-for-kafka-ecosystem-overview
Other
231 stars 214 forks source link

Unable to produce to EventHub with to `Error: NETWORK_EXCEPTION. Error Message: Disconnected from node `, caused by `max.request.size` #255

Open lgo opened 6 months ago

lgo commented 6 months ago

:wave: I'm filing this one mostly as feedback as to whether the failure mode could be a little more obvious or graceful for users. Also, I hope others may find this useful if they go searching for the same errors. Recently, I found myself setting up KafkaMirrorMaker2 for EventHub-to-EventHub mirroring.

The same setup has already been in use, and happened to have max.request.size set to 20971520 (20MiB) for the producer. When I was using the same setup for EventHub I was running into errors on the Kafka producer that I was unable to pin down. They were along the lines of:

Got error produce response with correlation id 6397 on topic-partition <MYTOPIC>-7, retrying (2147481516 attempts left). Error: NETWORK_EXCEPTION. Error Message: Disconnected from node 0 (org.apache.kafka.clients.producer.internals.Sender)

Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)

Cancelled in-flight PRODUCE request with correlation id 6391 due to node 0 being disconnected (elapsed time since creation: 45ms, elapsed time since send: 45ms, request timeout: 30000ms) (org.apache.kafka.clients.NetworkClient)

Now, I eventually combed through plenty of resources on getting things setup, like:

Eventually, I got it figured out once I applied every configuration in the recommendation and it got unwedged after setting max.request.size. This oversized request was exercised because the mirror source topic has plenty of data (for testing). In hindsight, the recommendations guide indicates this will happen:

The service will close connections if requests larger than 1,046,528 bytes are sent. This value must be changed and will cause issues in high-throughput produce scenarios.

There's maybe a few things that can be improved here:

Thanks!

ianarsenault commented 1 month ago

+1 on this. This is the error I am also hitting and taking a lot of time to sort out the root cause. It would be helpful to have more details around this and what to check!

lgo Could you share what your end producerConf ended up being to resolve this issue?

lgo commented 2 days ago

@ianarsenault sorry I didn't get back earlier! We use the recommended value from the EventHub recommended kafka configuration I linked above. Specifically, this:

max.request.size=1046527 # (1046528 [documented range is <1046528] - 1)

(and, with MirrorMaker2 / strimzi's KMM2 a bit of fiddling is required to get this to apply correctly via producer.max.request.size / producer.override.max.request.size)

We also use some other adjusted values with specific recommended ranges from the doc, like:

metadata.max.idle.ms=180000
metadata.max.age.ms=180000
connections.max.idle.ms=180000