StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.86k stars 1.5k forks source link

Managing Redis Connection Loss on Azure Linux App Service #2595

Closed FPierrel closed 6 months ago

FPierrel commented 8 months ago

Hello,

We are experiencing undetected connection losses with Azure Cache for Redis once or twice a month on our .NET 6 application hosted with Docker on a Azure Linux App Service. This leads to Redis Timeout Exceptions, and the number of clients connected to the Redis server drops sharply. The problem occurs at the same time on about half the instances. All applications hosted on the same instance experience the same problem at the same time.

The problem resolves itself after 15 minutes. We believe it's due to the TCP parameter net.ip4v.tcp_retries2 (#1848)

We understand that the origin of these errors is external to our application, but the frequency with which they occur surprises us. Is it normal for this to happen so regularly?

How can this problem be corrected or mitigated? We've tried updating the library to the latest version and enabling the ForceReconnect pattern with the switch Microsoft.AspNetCore.Caching.StackExchangeRedis.UseForceReconnect with no effect.


Here is an example of stacktrace when this happens:

Timeout awaiting response (outbound=0KiB, inbound=15KiB, 5984ms elapsed, timeout is 5000ms), command=HMGET, next: EXPIRE ecentred119300e-be34-276a-2255-d347328600d2, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, last-in: 0, cur-in: 0, sync-ops: 223371, async-ops: 753868, serverEndpoint: xxxx:6380, conn-sec: 8596.01, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: 3a734420707e(SE.Redis-v2.6.122.38350), IOCP: (Busy=0,Free=1000,Min=250,Max=1000), WORKER: (Busy=3,Free=32764,Min=250,Max=32767), POOL: (Threads=19,QueuedItems=0,CompletedItems=9254999,Timers=256), v: 2.6.122.38350 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts) 

StackExchange.Redis.RedisTimeoutException:
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Extensions.Caching.StackExchangeRedis.RedisCache+<GetAndRefreshAsync>d__39.MoveNext (Microsoft.Extensions.Caching.StackExchangeRedis, Version=8.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Extensions.Caching.StackExchangeRedis.RedisCache+<GetAsync>d__27.MoveNext (Microsoft.Extensions.Caching.StackExchangeRedis, Version=8.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.AspNetCore.Session.DistributedSession+<CommitAsync>d__31.MoveNext (Microsoft.AspNetCore.Session, Version=6.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60)

Here's the evolution of the number of clients connected to the Redis server image

NickCraver commented 8 months ago

cc @philon-msft for similar reports of the 15 minute issue here

philon-msft commented 8 months ago

UseForceReconnect is designed to help in this situation, but it may not be getting triggered by the RedisTimeoutExceptions thrown in this scenario. RedisConnectionExceptions or SocketExceptions would trigger it, but when commands are simply timing out while the socket is stalled, only RedisTimeoutExceptions are thrown.

@mgravell am I understanding the UseForceReconnect code correctly? Any ideas for detecting/replacing stalled connections faster in this situation?

FPierrel commented 8 months ago

That's how it seemed to us. Out of desperation, we modified the IDistributedCache implementation to take account of RedisTimeoutExceptions, but this is obviously not a long-term solution. Luckily, we never encounter any timeouts during normal operation.

Couldn't there be a problem with the library itself? We also have java applications using Jedis, not on the same App Plan but connected to the same Redis server, and the problem never occurs.

NickCraver commented 7 months ago

The OS is telling us the socket is fine and that's the main problem here. But we also know that commands are timing out (due to no responses) and that we have read anything at all off the socket. I'm taking a stab at handling this better so that we recovery much faster (in 4 timeouts) over in #2610.

Other libraries open a new connection per command and other approaches (slower, different issues) or maybe they have handling for this - not sure but I'm seeing if we can make the multiplexed case better there.

philon-msft commented 7 months ago

Update: a new version 2.7.10 has been released, including #2610 to detect and recover stalled sockets. This should help prevent the situation where connections can stall for ~15 minutes on Linux clients.

FPierrel commented 7 months ago

Good news, we will install this version on our test environment with traces activated.

@philon-msft I guess it's better to disable the Microsoft.AspNetCore.Caching.StackExchangeRedis.UseForceReconnect switch? ​

philon-msft commented 7 months ago

I'd recommend leaving ForceReconnect enabled for an additional layer of protection.

kalaumapathy commented 5 months ago

Is anyone facing any issues with https://github.com/StackExchange/StackExchange.Redis/releases/tag/2.7.10? We need the fix for the 15-minute saga but are concerned about load and other issues. Please advise @philon-msft @NickCraver

NickCraver commented 5 months ago

@kalaumapathy There's a null ref we fixed in 2.7.17, but this was not new - just solving a rare case of using-after-disposed a few users were hitting. I'd have no reservations about 2.7.x, and advise using latest.

kalaumapathy commented 5 months ago

@kalaumapathy There's a null ref we fixed in 2.7.17, but this was not new - just solving a rare case of using-after-disposed a few users were hitting. I'd have no reservations about 2.7.x, and advise using latest.

Thank you @NickCraver . We will do a load test before sending it to prod and will report here if we see scale issues. Also, How could we simulate this in lower env for testing these events? 1. We would like to log all the maintenance events from MS, and validate if this fix helps to recover from that rather than waiting for 15 mins. 2. Try to remove a shard from an existing setup or Remove/swap a replica node and how it affects during the load. 3. What other possible scenarios would you recommend to validate so this fix is working, and what other areas do we need to keep an eye on? 4. We would also like to use the newly introduced logging feature.