bcage29 commented 3 years ago

To simulate a network failure we reboot both the primary and replica nodes in an Azure Cache for Redis instance and have found that the library reacts differently based on the host it is deployed to.

Application

.NET 5.0 app (uses a factory and lazy implementation for Redis Connection)
StackExchange.Redis 2.2.62

Expected Result

Both nodes go down at the same time (or within a small time window).
The application will report StackExchange.Redis.RedisConnectionException exceptions.
The nodes will restart and be available approximately 1 minute after they go down.
The library will reconnect approximately 1 minute after the nodes went down.

Windows & Docker on Windows Result

The application reconnects approximately 1 minute after the nodes went down as expected.

Error:

StackExchange.Redis.RedisConnectionException: No connection is active/available to service this operation: SET N4BDN; It was not possible to connect to the redis server(s). There was an authentication failure; check that passwords (or client certificates) are configured correctly. ConnectTimeout, mc: 1/1/0, mgr: 10 of 10 available, clientName: 02cbef6fa5b6, IOCP: (Busy=0,Free=1000,Min=200,Max=1000), WORKER: (Busy=1,Free=32766,Min=200,Max=32767), v: 2.2.62.27853
       ---> StackExchange.Redis.RedisConnectionException: It was not possible to connect to the redis server(s). There was an authentication failure; check that passwords (or client certificates) are configured correctly. ConnectTimeout
         --- End of inner exception stack trace ---
         at StackExchange.Redis.ConnectionMultiplexer.ThrowFailed[T](TaskCompletionSource`1 source, Exception unthrownException) in /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:line 2802
      --- End of stack trace from previous location ---

Load Test Result

dockerWin10

Linux Result

The application throws TimeoutExceptions and does not reconnect for 15 minutes.

Error:

StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=0KiB, inbound=0KiB, 5570ms elapsed, timeout is 5000ms), command=SET, next: SET FAO1X, inst: 0, qu: 0, qs: 12, aw: False, rs: ReadAsync, ws: Idle, in: 0, serverEndpoint: <instancename>.redis.cache.windows.net:6380, mc: 1/1/0, mgr: 10 of 10 available, clientName: SandboxHost-637654330433879470, IOCP: (Busy=0,Free=1000,Min=200,Max=1000), WORKER: (Busy=2,Free=32765,Min=200,Max=32767), v: 2.2.62.27853 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Load Test Result

linuxContainer

Observations

When running on a Linux server you can update the sysctl setting net.ipv4.tcp_retries2. This setting decides the total time before a connection failure is declared. Lowering this setting to '5', I found that the application threw the correct type of errors StackExchange.Redis.RedisConnectionException and reconnected approximately 1 minute after the nodes went down. The downside to making this change is that it is a TCP setting for the server and if have multiple applications running on that server, they are all affected.
Installing Docker on the Linux server, updating the sysctl setting net.ipv4.tcp_retries2 to 5 and running the application as a container did not reconnect quickly. Updating the setting did not have any impact when the application reconnected. It reconnected after 15 minutes.
Following the Best Practices guide, you should be implementing a ForceReconnect method to handle these types of scenarios. The documentation also says,

Don't call ForceReconnect for Timeouts, just for RedisConnectionExceptions or SocketExceptions
In this situation, when the application is running on Linux it throws TimeoutExceptions, which the documentation says do not call the ForceReconnect code.

Questions

Is this something that can be handled or improved in the StackExchange.Redis library?
Are there Best Practice TCP Settings that should be used when running on Linux?
Should the Best Practice be to call ForceReconnect on TimeoutExceptions when running on Linux and also when you encounter RedisConnectionExceptions?

Referenced Issues

1782

1822

botinko commented 3 years ago

Looks like we faced the same issue. After migration our infrastructure to Linux and .Net Core 5, we starting to getting the following issue. Some instances of the application hang on Redis operation for about 15 minutes. After that, they are restored without any external influence. We are using client version 1.2.6.

We have 2 dump of application from 2 different instances in this state. It contains sensitive data, so I cannot share it, but I can share some data needed for investigation.

The Started Requests Count drop in the middle is just deploy.

botinko commented 3 years ago

We were able to reproduce the problem under controlled conditions. We will try to check how net.ipv4.tcp_retries2 affects client behavior.

philon-msft commented 3 years ago

Connection stalls lasting for 15 minutes like this are often caused by very optimistic default TCP settings in some Linux distros (confirmed on CentOS so far). When a server stops responding without gracefully closing the connection, the client TCP stack will continue retransmitting packets for 15 minutes before declaring the connection dead and allowing the StackExchange.Redis reconnect logic to kick in.

With Azure Cache for Redis, it's fairly easy to reproduce this by rebooting nodes as mentioned above. In this case, the machine goes down abruptly and the Redis server isn't able to transmit a FIN packet to the client. The client TCP stack continues retransmitting on the same socket hoping the server will come back up. Even when the node has rebooted and come back, it has no record of that connection so it continues ignoring the client. If the client gave up and created a NEW connection, it would be able to resume communication with the server much sooner than 15 minutes.

As you found, there are TCP settings you can change on the client machine to force it to timeout the connection sooner and allow for reconnect. In addition to tcp_retries2, you can try tuning the keepalive settings as discussed here: https://github.com/lettuce-io/lettuce-core/issues/1428#issuecomment-699992158. It should be safe to reduce these timeouts to more realistic durations machine-wide unless you have systems that actually depend on the unusually long retransmits.

An additional approach is using the ForceReconnect pattern recommended in the Azure best practices. If you're seeing issues like this, it's perfectly appropriate to trigger reconnect on RedisTimeoutExceptions in addition to RedisConnectionExceptions. Just don't be too aggressive with it because an overloaded server can also result in persistent RedisTimeoutExceptions. Recreating connections in that situation can cause additional server load and a cascade failure.

Unfortunately there's not much the StackExchange.Redis library can do about this situation, because the Linux TCP stack is hiding the lost connection. Detecting the stall at the library level would require making assumptions that would almost certainly lead to false positives in some scenarios. Instead, it's better for the client application to implement some detection/reconnection logic based on what it knows about its load and latency patterns.

debu99 commented 2 years ago

it is due to connection in the connection pool is invalid, check the keep alive time setting

bcage29 commented 2 years ago

Closing this issue since I think this documents the issue and how to remedy when encountered.

ShaneCourtrille commented 2 years ago

Currently those of us using Azure App Service Linux containers cannot adjust the values as we can neither pass parameters to the docker run command nor can we run in privileged mode. One of these are required to be able to modify underlying values such as tcp_retries2.

philon-msft commented 2 years ago

@ShaneCourtrille fair point that the TCP configuration is not accessible in many client app environments. In those cases, it's best to implement a ForceReconnect pattern to detect and replace connections that have stalled. You can find examples of the pattern in the quickstart samples here: https://github.com/Azure-Samples/azure-cache-redis-samples/tree/main/quickstart

ShaneCourtrille commented 2 years ago

@philon-msft Are you aware of any implementation of the ForceReconnect pattern when DistributedCache is in use? I opened this issue with them but it's not going to be looked at for awhile.

philon-msft commented 2 years ago

@ShaneCourtrille For DistributedCache it looks like any ForceReconnect pattern will need to be implemented in aspnetcore code, so the issue you opened is the right long-term approach.
In the short term, Azure Cache for Redis has released a fix to ensure that connections are closed gracefully by the server rather than dropped. That should help reduce the need for ForceReconnect if you're using Azure Redis.

mgravell commented 1 year ago

cross ref: https://github.com/dotnet/aspnetcore/pull/45261

adamyager commented 1 year ago

@ShaneCourtrille For DistributedCache it looks like any ForceReconnect pattern will need to be implemented in aspnetcore code, so the issue you opened is the right long-term approach. In the short term, Azure Cache for Redis has released a fix to ensure that connections are closed gracefully by the server rather than dropped. That should help reduce the need for ForceReconnect if you're using Azure Redis.

@philon-msft are you of the opinion that the net.ipv4.tcp_retries2 setting is still needed in addition to cross ref: https://github.com/dotnet/aspnetcore/pull/45261 or would the ForceReconnect pattern be an acceptable alternative?

In AKS, we need to allow Privileged containers to make this change and it goes against what we really want to do or the Azure policy that is in place. We can also have workloads on those pods that may not benefit from this lower value or could possibly cause issues.

I see them both in the best practices so I was not sure if that was a consideration here.

thanks for your insights!

philon-msft commented 1 year ago

@adamyager configuring et.ipv4.tcp_retries2 would be an additional layer of defense-in-depth, but I agree it's not a good fit in AKS. ForceReconnect alone would suffice, or you could simply have pods fail their health checks if they're experiencing persistent Redis failures. Then Kubernetes would replace them with fresh pods that should come up with healthy Redis connections. That's more heavyweight than ForceReconnect, but simpler to implement and better aligned with k8s.

adamyager commented 1 year ago

@adamyager configuring et.ipv4.tcp_retries2 would be an additional layer of defense-in-depth, but I agree it's not a good fit in AKS. ForceReconnect alone would suffice, or you could simply have pods fail their health checks if they're experiencing persistent Redis failures. Then Kubernetes would replace them with fresh pods that should come up with healthy Redis connections. That's more heavyweight than ForceReconnect, but simpler to implement and better aligned with k8s.

This is great context. I do wonder if this could be called out in the best practices guide. Azure Support is pointing us to this and I think the answer is more nuanced. I would speculate that most of the Linux interactomes in Azure are via AKS and not Linux IaaS. As was called out, Linux on App Service cannot have the TCP settings modified. So a large majority need a better solution! Super helpful insights

philon-msft commented 1 year ago

@adamyager great point - I've created a work item to get the Azure Redis docs updated to include this suggestion for clients running in Kubernetes.

adamyager commented 1 year ago

@philon-msft This is super helpful. Thank you. One other item. A peer of yours a Microsoft that is a heavy contributor to this client is suggesting that none of this is needed if we use a newer client sdk. At least that’s how I read his post below. it would be super great for all clients with these patterns could understand if this can all be fixed with a newer client version. No tcp settings, no force reconnect in .net, etc. would you and he be able to connect and help me understand if that’s true?

Here is what he says for reference.

https://github.com/dotnet/aspnetcore/pull/45261#issuecomment-1377591422

philon-msft commented 1 year ago

@adamyager It's true that with recent versions of StackExchange.Redis it's very rare for client apps to experience the types of hangs or stalls that require ForceReconnect to recover. And improvements in the Azure Redis service have made it unlikely that clients will experience stalled sockets where default et.ipv4.tcp_retries2 settings would delay recovery. Years ago, we hadn't flushed out all those issues, so we recommended ForceReconnect more strongly. However, "rare" doesn't mean "never" -- it's still possible for something like a dropped packet or an app bug to cause a Redis connection to get stuck in a bad state on a client instance. Depending on app architecture and appetite for risk, app owners can accept that possibility, or build in a "last resort" backstop to detect and recover automatically (like ForceReconnect). In cloud-native apps, it's more common to accept the risk and rely on higher level systems like pod cycling to handle any rare situations that make instances unhealthy for any reason.

adamyager commented 1 year ago

@philon-msft great context again! Super helpful We have a LOB app that has had a few outages in what we think now is mostly on the Redis Client side(stackexchange). It seems they are client side configs/versions etc. So I guess you can say on "our side" I have been asked to help after several failures.

We are a big azure customer and speak to Product all time on various topics. I would really like a session with you on our experience. Azure Cache For Redis is one I think we can have a great discussion on that will benefit both parties. We did speak with the Company Redis via our Azure Account team but that was a bit less helpful as they are more focused on Enterprise Edition. Thanks again and have a great weekend

NickCraver commented 1 year ago

@adamyager can you shoot me a mail please to get in touch? nick craver @ microsoft without spaces yada yada :)

adamyager commented 1 year ago

@NickCraver thanks so much. I have sent an email and look forward to our chat. Have a great weekend

philon-msft commented 9 months ago

Update: a new version 2.7.10 of the library has been released, including a fix #2610 to detect and recover stalled sockets. This should help prevent the situation where connections can stall for ~15 minutes on Linux clients.

StackExchange / StackExchange.Redis

Connection does not re-establish for 15 minutes when running on Linux #1848

Application

Expected Result

Windows & Docker on Windows Result

Error:

Load Test Result

Linux Result

Error:

Load Test Result

Observations

Questions

Referenced Issues

1782

1822