Open amotl opened 4 years ago
Hi again,
based on the findings outlined above, I compiled a list of suggestions for our team the other day, which I also wanted to share here.
With kind regards, Andreas.
Based on findings from others when running Kafka on Azure and especially when using Azure Event Hubs with librdkafka, Magnus Edenhill, the author of librdkafka, implemented some fixes between version 1.3.0 and 1.5.0.
Magnus Edenhill:
produce/consume hang after partition goes away and comes back which is fixed in v1.4.2
-- https://github.com/confluentinc/confluent-kafka-python/issues/899#issuecomment-654777590
Magnus Edenhill:
As for the idle connection closing, I'm suspecting that ~their~ [the Azure] load balancers are silently dropping the connection without sending an RST or FIN to the client, so there is no way for the client TCP stack to know the connection was closed until a request (or TCP keepalive) is sent. If the connection was properly TCP-closed the client would get notified instantly of the close.
The fix in https://github.com/edenhill/librdkafka/commit/0527457 will go into v1.5.0 which is scheduled for early June. It will prefer to use the least idle connection when selecting "any" broker connection for metadata refresh, etc. This in combination with a
topic.metadata.refresh.interval.ms
< the LB/Kafka idle connection timeout should hopefully minimize the number of these timeout-due-to-killed-connection errors.
Prefer least-idle connection when randomly selecting broker for metadata queries, etc. (#2845)
Reasons:
- this connection is most likely to function properly.
- allows truly idle connections to be killed by the broker's/LB's
idle connection reaper.
This will fix issues like "Metadata request timed out" when the periodic
metadata refresh picks an idle (typically bootstrap) connection which
has exceeded the load-balancer's idle time (which silently kills the
connection without sending FIN/RST to the client).
-- via: https://github.com/edenhill/librdkafka/issues/2845
Charles Culver:
After finding recommended configuration settings for Azure Event Hubs (Microsoft's own Kafka cloud implementation) here, we added the following settings to consumers and producers and are seeing success.
-- https://github.com/edenhill/librdkafka/issues/2845#issuecomment-626027296
Here are the recommended configurations for using Azure Event Hubs from Apache Kafka client applications:
In the same spirit, I would also like to outline some recommended settings for mitigating the "Azure LB closing idle network connections" on the host level.
These settings probably might be applied to the nodes (VMs) as well as the Kubernetes PODs/containers.
net.ipv4.tcp_keepalive_time = 120
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 8
Following all of these observations, we planned to upgrade to librdkafka 1.5.0 back then. However, ...
There has been a regression in 1.5.0 regarding roundrobin crashes reported through https://github.com/edenhill/librdkafka/issues/3024 and https://github.com/confluentinc/confluent-kafka-dotnet/issues/1366 as well as https://github.com/confluentinc/confluent-kafka-dotnet/issues/1369, which was designated as a critical issue.
@mhowlett reported on behalf of @wwarby (https://github.com/confluentinc/confluent-kafka-dotnet/issues/1366#issuecomment-668669487):
Not sure if mine is the same issue or related in any way, but I've just run into massive instability after upgrading some high throughput C# apps from v 1.4.4 to 1.5.0. Rolled back to 1.4.4 and it solved the problem.
In my case, I'm consuming data from topics that already exist and I don't have topic auto-creation enabled - I'm running multiple instances of the C# app, each instantiating a single consumer to join the consumer group.
As we haven't been sure whether we might run into this on our production systems, we considered to be better off including the fix from https://github.com/edenhill/librdkafka/pull/3049. However, back then, the next official release was said to be 1.6.0 happening on October 15, 2020, so we decided to go for librdkafka-1.5.0-e320a2d making our systems happy so far.
Thank you so much!
Now, we are happy to see the original milestone 1.6.0 was repurposed to milestone 1.5.2 with a corresponding pre-release of v1.5.2-RC1 and will consider going for that within the next iteration.
P.S.: While going through this story, we also discovered other open issues about stalled/stuck consumers observed with 1.4.2 at https://github.com/edenhill/librdkafka/issues/2933, https://github.com/edenhill/librdkafka/issues/2944 and https://github.com/edenhill/librdkafka/issues/3082. As far as we have been able to see, these might be related to changing IP addresses while performing rolling updates on Kafka instances within a Kubernetes cluster. Our thoughts on this: Just don't do that ;]!
JFYI; I also just shared this at https://github.com/Azure/azure-functions-kafka-extension/issues/185.
Wow, that's an amazing write-up and RCA, @amotl ! Thank you for sharing your findings with the community.
Dear Magnus,
thank you for your comment, appreciate it. Thanks also for just releasing librdkafka 1.5.2 GA incorporating all the recent improvements in this regard and beyond.
Keep up the spirit and with kind regards, Andreas.
Hi again,
as I recognized it might not have become exactly obvious from my elaborations above, I wanted to add some more details here regarding the recommended configuration settings when running on Azure as also outlined at https://github.com/Azure/azure-functions-kafka-extension/issues/187.
Microsoft published them to [1] in general, but I wanted to specifically address here that the configuration properties socket.keepalive.enable
and metadata.max.age.ms
are apparently crucial to get right within Azure environments in order to save users from running into the networking issues reported within this discussion.
Property | Recommended Values | Permitted Range | Notes |
---|---|---|---|
socket.keepalive.enable | true | Necessary if connection is expected to idle. Azure will close inbound TCP idle > 240,000 ms. | |
metadata.max.age.ms | ~ 180000 | < 240000 | Can be lowered to pick up metadata changes sooner. |
While these settings [1] are primarily dedicated to Azure Event Hubs, I believe they will also apply to all communications with vanilla Kafka server components, as the underlying networking infrastructure problem will be the same.
We are now running with these settings and are happy so far:
socket.keepalive.enable = true
metadata.max.age.ms = 180000
Thanks for listening and with kind regards, Andreas.
Hi again,
I quickly wanted to share some more observations on this topic. On our Azure environment, we still saw some partitions occasionally stalling on the consumer side, even after applying all of the mitigations outlined above. However, after just moving on to v1.6.0-PRE3
, things appear to be really smooth now.
Kudos to @edenhill, @mhowlett and all people involved who added some recent improvements to librdkafka!
With kind regards, Andreas.
Added connections.max.idle.ms
to allow the client to close idle connections before the LB does.
@edenhill When is the probable date for 1.8 release ? waiting for "connections.max.idle.ms" change.
It is soak testing now, if all goes well we'll release within a week.
@edenhill Is V1.8.0-RC2 is the release candidate that is in soak testing?
@koushikchitta Yep, we'll be releasing it this week.
@edenhill, in some system is very important keep a low latency. Are there a option to renew the connection after connections.max.idle.ms
, instead of only close the connection?
@amotl, your investigation was amazing, thanks!
Just to be clear, I am interpretting these notes to be the following client configuration, I would love to know if there are any other recommended settings to help connect to azure event hubs.
var config = new ClientConfig
{
SaslMechanism = SaslMechanism.Plain,
SecurityProtocol = SecurityProtocol.SaslSsl,
SaslUsername = credentials.Username,
SaslPassword = credentials.Password,
BootstrapServers = $"{broker}:9093",
// According to this documentation, these two settings are also critical for
// azure environments
// https://github.com/confluentinc/librdkafka/issues/3109#issuecomment-714471123
SocketKeepaliveEnable = true,
MetadataMaxAgeMs = 30000,
// note: This needs to timeout sooner than the azure eventhub load balancer timeout
// to avoid unexpected disconnects from the server after long periods of time
// https://stackoverflow.com/questions/58010247/azure-eventhub-kafka-org-apache-kafka-common-errors-timeoutexception-for-some-of/58385324#58385324
ConnectionsMaxIdleMs = ((60 * 4) - 30) * 1000 // 3m30s
};
Hi there,
first things first: Thanks for the tremendous amount of work you are putting into librdkafka, @edenhill. You know who you are.
Introduction
This is not meant to be a specific bug report as we believe the issues we have been experiencing when using librdkafka for connecting to Azure Event Hubs have already been mitigated within librdkafka 1.5.2 and newer. In fact, they might not have been specific to Azure Event Hubs anyway but also might have tripped others when just running Apache Kafka or the Confluent Stack on Azure in general.
Instead, we wanted to share our findings as a wrapup and future reference for others looking for similar issues. In this manner, apologies for not completing the checklist. The issue can well be closed right away.
The topics are spanning the area of Azure networking (problems) in general, as well as things related to Kafka and Kubernetes.
So, here we go.
General research
Azure LB closing idle network connections
The problem here is that Azure network loadbalancing components silently drop idle network connections after 4 minutes.
The LB does not even bother to send RST packets to each of the communication partners, so client and server sockets will most probably try to reuse these dead connections.
In turn, services will be hitting individual socket timeouts or otherwise the Kernel will be doing retransmissions with backoff for another 15+ minutes until it considers the connection to be dead.
Quotes
Resources
Using Kafka and Event Hubs on Azure
Quotes
Resources
With kind regards, Andreas.