StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.88k stars 1.51k forks source link

Connection does not re-establish for 15 minutes when running on Linux #1848

Closed bcage29 closed 2 years ago

bcage29 commented 3 years ago

To simulate a network failure we reboot both the primary and replica nodes in an Azure Cache for Redis instance and have found that the library reacts differently based on the host it is deployed to.

Application

Expected Result

  1. Both nodes go down at the same time (or within a small time window).
  2. The application will report StackExchange.Redis.RedisConnectionException exceptions.
  3. The nodes will restart and be available approximately 1 minute after they go down.
  4. The library will reconnect approximately 1 minute after the nodes went down.

Windows & Docker on Windows Result

The application reconnects approximately 1 minute after the nodes went down as expected.

Error:

StackExchange.Redis.RedisConnectionException: No connection is active/available to service this operation: SET N4BDN; It was not possible to connect to the redis server(s). There was an authentication failure; check that passwords (or client certificates) are configured correctly. ConnectTimeout, mc: 1/1/0, mgr: 10 of 10 available, clientName: 02cbef6fa5b6, IOCP: (Busy=0,Free=1000,Min=200,Max=1000), WORKER: (Busy=1,Free=32766,Min=200,Max=32767), v: 2.2.62.27853
       ---> StackExchange.Redis.RedisConnectionException: It was not possible to connect to the redis server(s). There was an authentication failure; check that passwords (or client certificates) are configured correctly. ConnectTimeout
         --- End of inner exception stack trace ---
         at StackExchange.Redis.ConnectionMultiplexer.ThrowFailed[T](TaskCompletionSource`1 source, Exception unthrownException) in /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:line 2802
      --- End of stack trace from previous location ---

Load Test Result

dockerWin10

Linux Result

The application throws TimeoutExceptions and does not reconnect for 15 minutes.

Error:

StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=0KiB, inbound=0KiB, 5570ms elapsed, timeout is 5000ms), command=SET, next: SET FAO1X, inst: 0, qu: 0, qs: 12, aw: False, rs: ReadAsync, ws: Idle, in: 0, serverEndpoint: <instancename>.redis.cache.windows.net:6380, mc: 1/1/0, mgr: 10 of 10 available, clientName: SandboxHost-637654330433879470, IOCP: (Busy=0,Free=1000,Min=200,Max=1000), WORKER: (Busy=2,Free=32765,Min=200,Max=32767), v: 2.2.62.27853 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Load Test Result

linuxContainer

Observations

Questions

Referenced Issues

1782

1822

botinko commented 3 years ago

Looks like we faced the same issue. After migration our infrastructure to Linux and .Net Core 5, we starting to getting the following issue. Some instances of the application hang on Redis operation for about 15 minutes. After that, they are restored without any external influence. We are using client version 1.2.6.

We have 2 dump of application from 2 different instances in this state. It contains sensitive data, so I cannot share it, but I can share some data needed for investigation.

image

The Started Requests Count drop in the middle is just deploy.

botinko commented 3 years ago

We were able to reproduce the problem under controlled conditions. We will try to check how net.ipv4.tcp_retries2 affects client behavior.

philon-msft commented 3 years ago

Connection stalls lasting for 15 minutes like this are often caused by very optimistic default TCP settings in some Linux distros (confirmed on CentOS so far). When a server stops responding without gracefully closing the connection, the client TCP stack will continue retransmitting packets for 15 minutes before declaring the connection dead and allowing the StackExchange.Redis reconnect logic to kick in.

With Azure Cache for Redis, it's fairly easy to reproduce this by rebooting nodes as mentioned above. In this case, the machine goes down abruptly and the Redis server isn't able to transmit a FIN packet to the client. The client TCP stack continues retransmitting on the same socket hoping the server will come back up. Even when the node has rebooted and come back, it has no record of that connection so it continues ignoring the client. If the client gave up and created a NEW connection, it would be able to resume communication with the server much sooner than 15 minutes.

As you found, there are TCP settings you can change on the client machine to force it to timeout the connection sooner and allow for reconnect. In addition to tcp_retries2, you can try tuning the keepalive settings as discussed here: https://github.com/lettuce-io/lettuce-core/issues/1428#issuecomment-699992158. It should be safe to reduce these timeouts to more realistic durations machine-wide unless you have systems that actually depend on the unusually long retransmits.

An additional approach is using the ForceReconnect pattern recommended in the Azure best practices. If you're seeing issues like this, it's perfectly appropriate to trigger reconnect on RedisTimeoutExceptions in addition to RedisConnectionExceptions. Just don't be too aggressive with it because an overloaded server can also result in persistent RedisTimeoutExceptions. Recreating connections in that situation can cause additional server load and a cascade failure.

Unfortunately there's not much the StackExchange.Redis library can do about this situation, because the Linux TCP stack is hiding the lost connection. Detecting the stall at the library level would require making assumptions that would almost certainly lead to false positives in some scenarios. Instead, it's better for the client application to implement some detection/reconnection logic based on what it knows about its load and latency patterns.

debu99 commented 2 years ago

it is due to connection in the connection pool is invalid, check the keep alive time setting

bcage29 commented 2 years ago

Closing this issue since I think this documents the issue and how to remedy when encountered.

ShaneCourtrille commented 2 years ago

Currently those of us using Azure App Service Linux containers cannot adjust the values as we can neither pass parameters to the docker run command nor can we run in privileged mode. One of these are required to be able to modify underlying values such as tcp_retries2.

philon-msft commented 2 years ago

@ShaneCourtrille fair point that the TCP configuration is not accessible in many client app environments. In those cases, it's best to implement a ForceReconnect pattern to detect and replace connections that have stalled. You can find examples of the pattern in the quickstart samples here: https://github.com/Azure-Samples/azure-cache-redis-samples/tree/main/quickstart

ShaneCourtrille commented 2 years ago

@philon-msft Are you aware of any implementation of the ForceReconnect pattern when DistributedCache is in use? I opened this issue with them but it's not going to be looked at for awhile.

philon-msft commented 2 years ago

@ShaneCourtrille For DistributedCache it looks like any ForceReconnect pattern will need to be implemented in aspnetcore code, so the issue you opened is the right long-term approach.
In the short term, Azure Cache for Redis has released a fix to ensure that connections are closed gracefully by the server rather than dropped. That should help reduce the need for ForceReconnect if you're using Azure Redis.

mgravell commented 1 year ago

cross ref: https://github.com/dotnet/aspnetcore/pull/45261

adamyager commented 1 year ago

@ShaneCourtrille For DistributedCache it looks like any ForceReconnect pattern will need to be implemented in aspnetcore code, so the issue you opened is the right long-term approach. In the short term, Azure Cache for Redis has released a fix to ensure that connections are closed gracefully by the server rather than dropped. That should help reduce the need for ForceReconnect if you're using Azure Redis.

@philon-msft are you of the opinion that the net.ipv4.tcp_retries2 setting is still needed in addition to cross ref: https://github.com/dotnet/aspnetcore/pull/45261 or would the ForceReconnect pattern be an acceptable alternative?

In AKS, we need to allow Privileged containers to make this change and it goes against what we really want to do or the Azure policy that is in place. We can also have workloads on those pods that may not benefit from this lower value or could possibly cause issues.

I see them both in the best practices so I was not sure if that was a consideration here.

thanks for your insights!

philon-msft commented 1 year ago

@adamyager configuring et.ipv4.tcp_retries2 would be an additional layer of defense-in-depth, but I agree it's not a good fit in AKS. ForceReconnect alone would suffice, or you could simply have pods fail their health checks if they're experiencing persistent Redis failures. Then Kubernetes would replace them with fresh pods that should come up with healthy Redis connections. That's more heavyweight than ForceReconnect, but simpler to implement and better aligned with k8s.

adamyager commented 1 year ago

@adamyager configuring et.ipv4.tcp_retries2 would be an additional layer of defense-in-depth, but I agree it's not a good fit in AKS. ForceReconnect alone would suffice, or you could simply have pods fail their health checks if they're experiencing persistent Redis failures. Then Kubernetes would replace them with fresh pods that should come up with healthy Redis connections. That's more heavyweight than ForceReconnect, but simpler to implement and better aligned with k8s.

This is great context. I do wonder if this could be called out in the best practices guide. Azure Support is pointing us to this and I think the answer is more nuanced. I would speculate that most of the Linux interactomes in Azure are via AKS and not Linux IaaS. As was called out, Linux on App Service cannot have the TCP settings modified. So a large majority need a better solution! Super helpful insights

philon-msft commented 1 year ago

@adamyager great point - I've created a work item to get the Azure Redis docs updated to include this suggestion for clients running in Kubernetes.

adamyager commented 1 year ago

@philon-msft This is super helpful. Thank you. One other item. A peer of yours a Microsoft that is a heavy contributor to this client is suggesting that none of this is needed if we use a newer client sdk. At least that’s how I read his post below. it would be super great for all clients with these patterns could understand if this can all be fixed with a newer client version. No tcp settings, no force reconnect in .net, etc. would you and he be able to connect and help me understand if that’s true?

Here is what he says for reference.

https://github.com/dotnet/aspnetcore/pull/45261#issuecomment-1377591422

philon-msft commented 1 year ago

@adamyager It's true that with recent versions of StackExchange.Redis it's very rare for client apps to experience the types of hangs or stalls that require ForceReconnect to recover. And improvements in the Azure Redis service have made it unlikely that clients will experience stalled sockets where default et.ipv4.tcp_retries2 settings would delay recovery. Years ago, we hadn't flushed out all those issues, so we recommended ForceReconnect more strongly. However, "rare" doesn't mean "never" -- it's still possible for something like a dropped packet or an app bug to cause a Redis connection to get stuck in a bad state on a client instance. Depending on app architecture and appetite for risk, app owners can accept that possibility, or build in a "last resort" backstop to detect and recover automatically (like ForceReconnect). In cloud-native apps, it's more common to accept the risk and rely on higher level systems like pod cycling to handle any rare situations that make instances unhealthy for any reason.

adamyager commented 1 year ago

@philon-msft great context again! Super helpful We have a LOB app that has had a few outages in what we think now is mostly on the Redis Client side(stackexchange). It seems they are client side configs/versions etc. So I guess you can say on "our side" I have been asked to help after several failures.

We are a big azure customer and speak to Product all time on various topics. I would really like a session with you on our experience. Azure Cache For Redis is one I think we can have a great discussion on that will benefit both parties. We did speak with the Company Redis via our Azure Account team but that was a bit less helpful as they are more focused on Enterprise Edition. Thanks again and have a great weekend

NickCraver commented 1 year ago

@adamyager can you shoot me a mail please to get in touch? nick craver @ microsoft without spaces yada yada :)

adamyager commented 1 year ago

@NickCraver thanks so much. I have sent an email and look forward to our chat. Have a great weekend

philon-msft commented 9 months ago

Update: a new version 2.7.10 of the library has been released, including a fix #2610 to detect and recover stalled sockets. This should help prevent the situation where connections can stall for ~15 minutes on Linux clients.