Networking robustness and resiliency on Azure and beyond (AWS, GCP, AliCloud)

amotl commented 3 years ago

Hi there,

first things first: Thanks for the tremendous amount of work you are putting into CrateDB. You know who you are.

Introduction

This is not meant to be a specific bug report, but a general heads-up RCA as we believe the issues we have been experiencing when connecting to CrateDB within Azure environments are worth to share with the community.

We wanted to share our findings as a wrapup and future reference for others looking for similar issues. In this manner, apologies for not completing the checklist. The issue can well be closed right away.

The topics are spanning the area of cloud/managed networking (problems) in general, as well as things related to CrateDB and respective interactions from its client drivers. Specifically, these networking problems might happen on all environments where connections are going through a NAT gateway.

So, here we go.

General research

Azure LB closing idle network connections

The problem here is that Azure network loadbalancing components silently drop idle network connections after 4 minutes.

The LB does not even bother to send RST packets to each of the communication partners, so client and server sockets will most probably try to reuse these dead connections.

In turn, services will be hitting individual socket timeouts or otherwise the Kernel will be doing retransmissions with backoff for another 15+ minutes until it considers the connection to be dead.

Quotes

TL;DR: Azure has a nasty artificial limitation that results in being unable to use long-lived TCP connections that have >= 4 minutes of radio silence at any given point.

They screwed it up so hard that when connection does timeout, they acknowledge the following TCP packets with an ok flag that makes the sender think “everything is okay - the data I sent was received succesfully”, which is 100 % unacceptable way to handle error conditions.

This caused me so much pain and loss of productive work time.

-- https://joonas.fi/2017/01/23/microsoft-azures-networking-is-fundamentally-broken/

Resources

With kind regards, Andreas.

amotl commented 3 years ago

Elsewhere on Kafka

On https://github.com/edenhill/librdkafka/issues/3109, we already outlined similar observations with Apache/Confluent Kafka. There are some discussions about how they solved these problems from the client side within librdkafka which are also worth a read.

Quotes

"The problem here is that the producer has two TCP connections that can go idle for > 4 mins - at that point, Azure load balancers close out the idle connections. The Kafka client is unaware that the connections have been closed so it attempts to send a batch on a dead connection, which times out, at which point retry kicks in."

-- https://stackoverflow.com/a/58385324

Magnus Edenhill (@edenhill) says at https://github.com/edenhill/librdkafka/issues/2845#issuecomment-621778813:

Joakim, we've seen a couple of similar reports for users on Azure and we can't really provide an explanation, something stalls the request/response until it times out, and we believe this to be outside the client, so most likely something in Azure. I recommend opening an issue with the Azure folks.

Joakim Blach Andersen (@JoakimBlach) says at https://github.com/edenhill/librdkafka/issues/2845#issuecomment-624492858:

I got an answer from Azure:

The service closes idle connections (idle here means no request received for 10 minutes). It happens even when tcp keep-alive is enabled because that config only keeps the connection open but does not generate protocol requests. In your case, you have only observed the error on the idle event hub, it may be related to the idle connection being closed and the client SDK does not handle that correctly (Kafka).

Magnus Edenhill (@edenhill) says at https://github.com/confluentinc/confluent-kafka-dotnet/issues/1305#issuecomment-641092856:

We're trying to work around these Azure weak idle disconnects in the upcoming v1.5.0 release by reusing the least idle connection for things like metadata requests, which should keep that connection alive and not cause these idle disconnect request timeouts.

amotl commented 3 years ago

Mitigations

Introduction

Where applicable, Microsoft recommends to use keepalives to reset the outbound idle timeout, see also https://docs.microsoft.com/en-us/azure/load-balancer/troubleshoot-outbound-connection#idletimeout.

Examples

Within https://github.com/edenhill/librdkafka/issues/3109#issuecomment-709508504, we outlined how these things might be mitigated by applying respective TCP keepalive settings on the host OS level.

If that is not possible, the improvement https://github.com/crate/crate-python/pull/374 by @chaudum unlocks the option to achieve that on the application level for the Python client (available for crate-python 0.26.0 and up).

Linux Kernel TCP settings

In the same spirit, I would also like to outline some recommended settings for mitigating the "Azure LB closing idle network connections" on the host level.

These settings probably might be applied to the nodes (VMs) as well as the Kubernetes PODs/containers.

net.ipv4.tcp_keepalive_time = 120
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 8

Application settings for Python

connection = client.connect(
    "https://localhost:4200/",
    "socket_keepalive": True,
    "socket_tcp_keepidle": 120,
    "socket_tcp_keepintvl": 30,
    "socket_tcp_keepcnt": 8,
)

-- https://crate.io/docs/python/en/latest/connect.html#socket-options

Resources

Azure closing idle network connections
TCP settings for Azure VMs by @wbuchwalter and @telmosampaio
TCP settings for Azure VMs from the MSDN forums

amotl commented 3 years ago

Elsewhere on the Cloud

Following up on https://github.com/cloudfoundry/guardian/issues/70 and https://github.com/cloudfoundry/guardian/issues/165, I would also like to quote some valuable comments from others.

@arjenw says at https://github.com/cloudfoundry/guardian/issues/165#issuecomment-581978432:

Java did add that option to configure Socket Keep Alive options only since Java 11 (which is still relatively recent). That means that before Java 11 there was no way to configure Keepalive on a Socket. As a result, most libraries doing some socket stuff have not exposed this option in their library. This also effects for example JDBC drivers. They are usually compiled on an earlier Java version to make them compatible for most relevant JRE's.

Point is that the assumption that you can change in in your application itself basically prevents all Java applications from dropping connections quickly when the connection is broken. This assumption therefore also prevents use of proper failover in most Java applications.

Ideally indeed this should be handled by the app, but in reality in a lot of cases you still can't as the socket is hidden by a library. The only way to configure it then is by changing the OS keep alive settings.

@krumts says at https://github.com/cloudfoundry/guardian/issues/165#issuecomment-583469575:

On each of the various cloud providers (AWS, Azure, GCP, ...), the outbound connectivity has some constraints, which are imposed by various infrastructure components we use.

To be very specific, let's stick to a NAT Gateway:

Idle timeout: On Azure (at present) there is an idle timeout of 240s (not configurable), on AWS it is 350s (not configurable), on GCP it is something different but can be configured, on AliCloud it is 900s (not configurable).

NAT behavior after a timeout: On Azure - depending on the LB/NAT-type - used packets may be silently dropped, or may be responded with a RST. On AWS, the NAT will send a RST.

Let's consider an application which uses a connection pool. It creates a connection, uses it, puts it into the pool and takes it again from there after 400s.

If we run the same app on a CF installation in each of these infrastructures, we would end up [with these behaviors]:

On AliCloud, the connection would still be established and working.

On GCP, it depends on what we configured.

On AWS, the application will get a connection reset immediately when it tries to use the connection and may need to retry.

On Azure, if the packets are silently dropped, it would take up to 15+ minutes until the connection is considered dead. In the meanwhile, the kernel would be doing retransmissions with backoff.

[...]

Modifying the TCP parameters is hard:

It can't be done by kernel settings in the container, because it needs privileged access.

Setting such parameters on the socket level is often not an option. Applications use a lib, which uses a lib, which uses a lib, ..., which uses a lib which has access to the socket, e.g. socket <- jdbc <- OR Mapper <- API XYZ.

But first of all people first have to fail, then go to through the painful analysis, and only afterwards start looking for options.

mfussenegger commented 2 years ago

Thanks for gathering the information

I'm closing this as there is no further actionable item here and the information can still be found via the search. A follow up could be to integrate some of this into the documentation?

amotl commented 1 year ago

Also applied on CrateDB Cloud.

https://github.com/crate/crate-operator/pull/548

crate / crate