Robustness and resiliency on Azure

amotl commented 4 years ago

Hi there,

first things first: Thanks for the tremendous amount of work you are putting into librdkafka, @edenhill. You know who you are.

Introduction

This is not meant to be a specific bug report as we believe the issues we have been experiencing when using librdkafka for connecting to Azure Event Hubs have already been mitigated within librdkafka 1.5.2 and newer. In fact, they might not have been specific to Azure Event Hubs anyway but also might have tripped others when just running Apache Kafka or the Confluent Stack on Azure in general.

Instead, we wanted to share our findings as a wrapup and future reference for others looking for similar issues. In this manner, apologies for not completing the checklist. The issue can well be closed right away.

The topics are spanning the area of Azure networking (problems) in general, as well as things related to Kafka and Kubernetes.

So, here we go.

General research

Azure LB closing idle network connections

The problem here is that Azure network loadbalancing components silently drop idle network connections after 4 minutes.

The LB does not even bother to send RST packets to each of the communication partners, so client and server sockets will most probably try to reuse these dead connections.

In turn, services will be hitting individual socket timeouts or otherwise the Kernel will be doing retransmissions with backoff for another 15+ minutes until it considers the connection to be dead.

Quotes

TL;DR: Azure has a nasty artificial limitation that results in being unable to use long-lived TCP connections that have >= 4 minutes of radio silence at any given point.

They screwed it up so hard that when connection does timeout, they acknowledge the following TCP packets with an ok flag that makes the sender think “everything is okay - the data I sent was received succesfully”, which is 100 % unacceptable way to handle error conditions.

This caused me so much pain and loss of productive work time.

-- https://joonas.fi/2017/01/23/microsoft-azures-networking-is-fundamentally-broken/

Resources

Using Kafka and Event Hubs on Azure

Quotes

"The problem here is that the producer has two TCP connections that can go idle for > 4 mins - at that point, Azure load balancers close out the idle connections. The Kafka client is unaware that the connections have been closed so it attempts to send a batch on a dead connection, which times out, at which point retry kicks in."

-- https://stackoverflow.com/a/58385324

Magnus Edenhill:

Joakim, we've seen a couple of similar reports for users on Azure and we can't really provide an explanation, something stalls the request/response until it times out, and we believe this to be outside the client, so most likely something in Azure. I recommend opening an issue with the Azure folks.

Joakim Blach Andersen:

I got an answer from Azure: The service closes idle connections (idle here means no request received for 10 minutes). It happens even when tcp keep-alive is enabled because that config only keeps the connection open but does not generate protocol requests. In your case, you have only observed the error on the idle event hub, it may be related to the idle connection being closed and the client SDK does not handle that correctly (Kafka).

-- https://github.com/edenhill/librdkafka/issues/2845

Magnus Edenhill:

We're trying to work around these Azure weak idle disconnects in the upcoming v1.5.0 release by reusing the least idle connection for things like metadata requests, which should keep that connection alive and not cause these idle disconnect request timeouts.

-- https://github.com/confluentinc/confluent-kafka-dotnet/issues/1305#issuecomment-641092856

Resources

With kind regards, Andreas.

amotl commented 4 years ago

Hi again,

based on the findings outlined above, I compiled a list of suggestions for our team the other day, which I also wanted to share here.

With kind regards, Andreas.

Proposals

Upgrade to librdkafka 1.5.0

Based on findings from others when running Kafka on Azure and especially when using Azure Event Hubs with librdkafka, Magnus Edenhill, the author of librdkafka, implemented some fixes between version 1.3.0 and 1.5.0.

Magnus Edenhill:

produce/consume hang after partition goes away and comes back which is fixed in v1.4.2

-- https://github.com/confluentinc/confluent-kafka-python/issues/899#issuecomment-654777590

Magnus Edenhill:

As for the idle connection closing, I'm suspecting that ~their~ [the Azure] load balancers are silently dropping the connection without sending an RST or FIN to the client, so there is no way for the client TCP stack to know the connection was closed until a request (or TCP keepalive) is sent. If the connection was properly TCP-closed the client would get notified instantly of the close.

The fix in https://github.com/edenhill/librdkafka/commit/0527457 will go into v1.5.0 which is scheduled for early June. It will prefer to use the least idle connection when selecting "any" broker connection for metadata refresh, etc. This in combination with a topic.metadata.refresh.interval.ms < the LB/Kafka idle connection timeout should hopefully minimize the number of these timeout-due-to-killed-connection errors.

Prefer least-idle connection when randomly selecting broker for metadata queries, etc. (#2845)

Reasons:
- this connection is most likely to function properly.
- allows truly idle connections to be killed by the broker's/LB's
  idle connection reaper.

This will fix issues like "Metadata request timed out" when the periodic
metadata refresh picks an idle (typically bootstrap) connection which
has exceeded the load-balancer's idle time (which silently kills the
connection without sending FIN/RST to the client).

-- via: https://github.com/edenhill/librdkafka/issues/2845

Apply recommended settings for Kafka on Azure

Charles Culver:

After finding recommended configuration settings for Azure Event Hubs (Microsoft's own Kafka cloud implementation) here, we added the following settings to consumers and producers and are seeing success.

-- https://github.com/edenhill/librdkafka/issues/2845#issuecomment-626027296

Here are the recommended configurations for using Azure Event Hubs from Apache Kafka client applications:

amotl commented 4 years ago

Kernel TCP settings for Azure VMs

In the same spirit, I would also like to outline some recommended settings for mitigating the "Azure LB closing idle network connections" on the host level.

These settings probably might be applied to the nodes (VMs) as well as the Kubernetes PODs/containers.

net.ipv4.tcp_keepalive_time = 120
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 8

Resources

Azure closing idle network connections
TCP settings for Azure VMs by @wbuchwalter and @telmosampaio
TCP settings for Azure VMs from the MSDN forums

amotl commented 4 years ago

Following all of these observations, we planned to upgrade to librdkafka 1.5.0 back then. However, ...

There has been a regression in 1.5.0 regarding roundrobin crashes reported through https://github.com/edenhill/librdkafka/issues/3024 and https://github.com/confluentinc/confluent-kafka-dotnet/issues/1366 as well as https://github.com/confluentinc/confluent-kafka-dotnet/issues/1369, which was designated as a critical issue.

@mhowlett reported on behalf of @wwarby (https://github.com/confluentinc/confluent-kafka-dotnet/issues/1366#issuecomment-668669487):

Not sure if mine is the same issue or related in any way, but I've just run into massive instability after upgrading some high throughput C# apps from v 1.4.4 to 1.5.0. Rolled back to 1.4.4 and it solved the problem.

In my case, I'm consuming data from topics that already exist and I don't have topic auto-creation enabled - I'm running multiple instances of the C# app, each instantiating a single consumer to join the consumer group.

As we haven't been sure whether we might run into this on our production systems, we considered to be better off including the fix from https://github.com/edenhill/librdkafka/pull/3049. However, back then, the next official release was said to be 1.6.0 happening on October 15, 2020, so we decided to go for librdkafka-1.5.0-e320a2d making our systems happy so far.

Thank you so much!

Now, we are happy to see the original milestone 1.6.0 was repurposed to milestone 1.5.2 with a corresponding pre-release of v1.5.2-RC1 and will consider going for that within the next iteration.

P.S.: While going through this story, we also discovered other open issues about stalled/stuck consumers observed with 1.4.2 at https://github.com/edenhill/librdkafka/issues/2933, https://github.com/edenhill/librdkafka/issues/2944 and https://github.com/edenhill/librdkafka/issues/3082. As far as we have been able to see, these might be related to changing IP addresses while performing rolling updates on Kafka instances within a Kubernetes cluster. Our thoughts on this: Just don't do that ;]!

amotl commented 4 years ago

JFYI; I also just shared this at https://github.com/Azure/azure-functions-kafka-extension/issues/185.

edenhill commented 4 years ago

Wow, that's an amazing write-up and RCA, @amotl ! Thank you for sharing your findings with the community.

amotl commented 4 years ago

Dear Magnus,

thank you for your comment, appreciate it. Thanks also for just releasing librdkafka 1.5.2 GA incorporating all the recent improvements in this regard and beyond.

Keep up the spirit and with kind regards, Andreas.

amotl commented 4 years ago

Hi again,

as I recognized it might not have become exactly obvious from my elaborations above, I wanted to add some more details here regarding the recommended configuration settings when running on Azure as also outlined at https://github.com/Azure/azure-functions-kafka-extension/issues/187.

Microsoft published them to [1] in general, but I wanted to specifically address here that the configuration properties socket.keepalive.enable and metadata.max.age.ms are apparently crucial to get right within Azure environments in order to save users from running into the networking issues reported within this discussion.

Property	Recommended Values	Permitted Range	Notes
socket.keepalive.enable	true		Necessary if connection is expected to idle. Azure will close inbound TCP idle > 240,000 ms.
metadata.max.age.ms	~ 180000	< 240000	Can be lowered to pick up metadata changes sooner.

While these settings [1] are primarily dedicated to Azure Event Hubs, I believe they will also apply to all communications with vanilla Kafka server components, as the underlying networking infrastructure problem will be the same.

We are now running with these settings and are happy so far:

socket.keepalive.enable = true
metadata.max.age.ms = 180000

Thanks for listening and with kind regards, Andreas.

[1] https://docs.microsoft.com/en-us/azure/event-hubs/apache-kafka-configurations#librdkafka-configuration-properties

amotl commented 4 years ago

Hi again,

I quickly wanted to share some more observations on this topic. On our Azure environment, we still saw some partitions occasionally stalling on the consumer side, even after applying all of the mitigations outlined above. However, after just moving on to v1.6.0-PRE3, things appear to be really smooth now.

Kudos to @edenhill, @mhowlett and all people involved who added some recent improvements to librdkafka!

With kind regards, Andreas.

edenhill commented 3 years ago

Added connections.max.idle.ms to allow the client to close idle connections before the LB does.

koushikchitta commented 3 years ago

@edenhill When is the probable date for 1.8 release ? waiting for "connections.max.idle.ms" change.

edenhill commented 3 years ago

It is soak testing now, if all goes well we'll release within a week.

koushikchitta commented 3 years ago

@edenhill Is V1.8.0-RC2 is the release candidate that is in soak testing?

edenhill commented 3 years ago

@koushikchitta Yep, we'll be releasing it this week.

robsonpeixoto commented 2 years ago

@edenhill, in some system is very important keep a low latency. Are there a option to renew the connection after connections.max.idle.ms, instead of only close the connection?

@amotl, your investigation was amazing, thanks!

justinmchase commented 10 months ago

Just to be clear, I am interpretting these notes to be the following client configuration, I would love to know if there are any other recommended settings to help connect to azure event hubs.

var config = new ClientConfig
{
    SaslMechanism = SaslMechanism.Plain,
    SecurityProtocol = SecurityProtocol.SaslSsl,
    SaslUsername = credentials.Username,
    SaslPassword = credentials.Password,
    BootstrapServers = $"{broker}:9093",

    // According to this documentation, these two settings are also critical for
    // azure environments
    // https://github.com/confluentinc/librdkafka/issues/3109#issuecomment-714471123
    SocketKeepaliveEnable = true,
    MetadataMaxAgeMs = 30000,

    // note: This needs to timeout sooner than the azure eventhub load balancer timeout
    // to avoid unexpected disconnects from the server after long periods of time
    // https://stackoverflow.com/questions/58010247/azure-eventhub-kafka-org-apache-kafka-common-errors-timeoutexception-for-some-of/58385324#58385324
    ConnectionsMaxIdleMs = ((60 * 4) - 30) * 1000 // 3m30s
};

confluentinc / librdkafka