Timeout unreasonably longer than AsyncTimeout

rayao commented 1 year ago

Recently our Azure service encountered a Redis instance down incident, during the down time, our Redis client throws RedisTimeoutException with unreasonably long timeout time, like this:

Failed to make redis cache call due to exception: StackExchange.Redis.RedisTimeoutException, Timeout awaiting response (outbound=2KiB, inbound=14KiB, 20969ms elapsed, timeout is 2000ms), command=SETEX, next: UNWATCH, inst: 0, qu: 152, qs: 0, aw: True, bw: WritingMessage, rs: ReadAsync, ws: Flushed, in: 0, last-in: 1, cur-in: 0, sync-ops: 16661, async-ops: 2226416, serverEndpoint: 104.42.147.56:15000, conn-sec: 45102.45, mc: 1/1/0, mgr: 10 of 10 available, clientName: RDA085FC4D4170(SE.Redis-v2.6.96.30123), PerfCounterHelperkeyHashSlot: 7889, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=6,Free=32761,Min=320,Max=32767), POOL: (Threads=12,QueuedItems=0,CompletedItems=47866509), v: 2.6.96.30123

On 1 of our VMs, we saw 102550 timeout exceptions in total, 92% of them longer than 2 seconds, 80% > 10 seconds, 72% > 20 seconds, 33% > 80 seconds, I can even see 977 exceptions with timeout > 120 seconds. This made thousands of requests stuck in Processing stage, and further resulted in some other very bad situations. I want to know if that's avoidable or not, if yes, how to configure the connection or client lib.

NickCraver commented 1 year ago

Do any of your other exceptions show higher thread counts? My best guess here is that under load you may have exhausted the thread pool causing timers not to fire for an extended duration (or you had a lot of timers ahead of us, preventing the next from firing?) - the trends for the POOL section above across your error history would be enlightening there.

rayao commented 1 year ago

@NickCraver During the entire incident, thread count is below 100, seems to me not thread pool exhausted. In our own code timer creation rate is pretty low, likely 1 or 2 per 5 seconds. I collected some data here, hope that helps. We did close & reconnect Redis connection during the timeout burst, every VM here reconnected once, when avg/max elapsed dropped. The VM which has max queued Another VM Another VM which has relatively less queued

sabaton1111 commented 1 year ago

@NickCraver We are experiencing something similar everyday. StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=0KiB, inbound=3KiB, 5125ms elapsed, timeout is 5000ms), command=HMGET, next: INFO, inst: 0, qu: 0, qs: 0, aw: False, bw: Inactive, rs: ReadAsync, ws: Idle, in: 7693, last-in: 0, cur-in: 0, sync-ops: 2, async-ops: 297, serverEndpoint: servername.redis.cache.windows.net:6380, conn-sec: 67.51, mc: 1/1/0, mgr: 10 of 10 available, clientName: clname(SE.Redis-v2.6.96.30123), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=3,Free=32764,Min=1,Max=32767), POOL: (Threads=11,QueuedItems=22,CompletedItems=2557), v: 2.6.96.30123 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Redis metrics from Azure App Insights:

CPU max load is under 30% (11% when timeouts started)
Used Memory Percentage is 1%
Server load is below 25%
Max expired keys ~5.5k

Maybe we should reconfigure something in our Redis?

rayao commented 1 year ago

@sabaton1111 seems to me your min worker thread number is too low?

sabaton1111 commented 1 year ago

@rayao Thank you, I will give it a try. :)

rayao commented 1 year ago

@NickCraver do you think those statistics explain the delay? What could be done from consuming side to make the situation better?

NickCraver commented 1 year ago

@rayao I'm trying to reason about this - how many are in the bw: WritingMessage state when you see timeouts? I wonder if you're in a case where we can send but not receive for some reason and the backlog is happily busy purging the queue downstream, not assessing timeouts because it's continually active...but this is purely a theory.

I'm looking at adding timer counts for .NET Core in the next release to timeout messages now, so we can rule that out easier.

NickCraver commented 1 year ago

To eliminate the timer overload cause, I'm adding this to logging in #2500.

rayao commented 1 year ago

@NickCraver I can't query log anymore, 5/30 log is purged now for too long.

vladislav-karamfilov commented 1 year ago

To eliminate the timer overload cause, I'm adding this to logging in #2500.

@NickCraver, what is the expected count of timers? We observe some timeout exceptions from the last 3 days and all of them have >= 23 timers (1 of them even has 409). Is it too many or we should seek the issue elsewhere?

Here's an example of failed HGETALL from a collection that had 2 (short string) elements max at the time of the exception:

Timeout awaiting response (outbound=0KiB, inbound=0KiB, 5297ms elapsed, timeout is 5000ms), command=HGETALL, next: HGETALL asterisk-call-timers, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, last-in: 0, cur-in: 0, sync-ops: 0, async-ops: 121753, serverEndpoint: <server>, conn-sec: 43573.57, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: Public000000(SE.Redis-v2.6.122.38350), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=4,Free=32763,Min=10,Max=32767), POOL: (Threads=8,QueuedItems=0,CompletedItems=2142304,Timers=27), v: 2.6.122.38350 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

stambunan commented 1 year ago

@NickCraver , We've also recently just updated and are seeing large numbers of timers on timeout exceptions. We use Polly to do retries when a request fails and I'm wondering if this could be a reason that our Timers is so high?

Sample exception:

Redis intermittent exception: StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=0KiB, inbound=0KiB, 16250ms elapsed, timeout is 10000ms), command=LRANGE, next: LRANGE value, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 33, last-in: 2, cur-in: 0, sync-ops: 637400, async-ops: 126159, serverEndpoint: redis-endpoint, conn-sec: 220544.69, aoc: 0, mc: 1/1/0, mgr: 9 of 10 available, clientName: workername(SE.Redis-v2.6.122.38350), IOCP: (Busy=0,Free=1000,Min=10,Max=1000), WORKER: (Busy=12,Free=32755,Min=10,Max=32767), POOL: (Threads=15,QueuedItems=2203,CompletedItems=199785096,Timers=29004), v: 2.6.122.38350 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

NickCraver commented 1 year ago

@vladislav-karamfilov Not worried about that timer count, on yours I'd check SLOWLOG on server-side as I don't see anything standing out there.

NickCraver commented 1 year ago

@stambunan whoa that's a lot of timers, holy crap. I can't say why they're so high (given the volume, I'd try a memory dump at any busy period and you'll probably get lucky with some volume in play). Normally we see this with things like spinning up a timer per command for timeouts or some such wrapping somewhere, so it could be your retry or could be something else.

Please note that we see some thread elevation, queued items, and sync operations against Redis, all pointing to some amount of thread pool starvation. Combined with really high timers, I'm not surprised you'd see some contention/timeouts under load. Recommendations would be: find out what those timers are, trim down, and eliminate sync ops to Redis.

vladislav-karamfilov commented 1 year ago

@vladislav-karamfilov Not worried about that timer count, on yours I'd check SLOWLOG on server-side as I don't see anything standing out there.

Almost all of the commands in the output of SLOWLOG are not from our app. They are: setclientaddr, INFO replication, INFOINTERNAL timings, ManagementCommand ACK, ManagementCommand NO ONE`. I suspect that the issue is internal in the Azure resource because a couple of days ago I noticed that there were spikes of cache reads on approximately 1 hour interval and did a reboot and the issue was gone. But even without the spikes now, we still observe timeouts. Any suggestion how to proceed from here?

philon-msft commented 1 year ago

@vladislav-karamfilov it sounds like high Redis server load is causing commands to run slowly, causing even low-cost commands to appear in the SLOWLOG. I'd recommend investigating the high load - docs like this may be helpful: https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-troubleshoot-timeouts

stambunan commented 1 year ago

Thanks @NickCraver . I think we found a different issue which caused the active timers to be so high. Now, when we see timeouts it's in the 200s but we do use a couple of timers in our code which might be the reason for this.

vladislav-karamfilov commented 1 year ago

@vladislav-karamfilov it sounds like high Redis server load is causing commands to run slowly, causing even low-cost commands to appear in the SLOWLOG. I'd recommend investigating the high load - docs like this may be helpful: https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-troubleshoot-timeouts

This is our server load graph:

We did a lot of improvements to decrease the server load (smaller values, less connections, increased worker pool, optimized requests) at the end of June and beginning of July and it seemed that we fixed the issue around July 5th (https://github.com/StackExchange/StackExchange.Redis/issues/2486#issuecomment-1626900693). There were no timeout exceptions till July 25th and all of a sudden timeout exceptions started to appear in our Application Insights logs again. On July 23th we had a scheduled server maintenance (confirmed by following the steps from https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-troubleshoot-timeouts#server-maintenance). I think the new timeout exceptions wave is related to this maintenance event. As I wrote in my last comment, last week I also tried a reboot that seemed to have fixed the issue for a couple of days but after that we still observe timeout exceptions.

dstoynov-suzu commented 1 year ago

I can confirm that from 2 months or less we started to get these kind of errors in burst batches Timeout awaiting response (outbound=2KiB, inbound=513KiB, 6062ms elapsed, timeout is 5000ms), command=HMGET, next: HMGET XXXXXXXXXXXXX, inst: 0, qu: 34, qs: 53, aw: True, bw: WritingMessage, rs: DequeueResult, ws: Writing, in: 0, in-pipe: 526205, out-pipe: 0, last-in: 0, cur-in: 526205, sync-ops: 0, async-ops: 3676273, serverEndpoint: rd.redis.cache.windows.net:6379, conn-sec: 93130.31, aoc: 0, mc: 1/1/0, mgr: 9 of 10 available, clientName: RD0003FF552664(SE.Redis-v2.6.122.38350), IOCP: (Busy=0,Free=1000,Min=280,Max=1000), WORKER: (Busy=40,Free=32727,Min=280,Max=32767), POOL: (Threads=40,QueuedItems=32,CompletedItems=28313302,Timers=46), v: 2.6.122.38350

and I've tried all the suggestion I've mange to find - from increasing the Redis from C2 to C3 (increasing the bandwidth) and increasing the worker count (280 as you can see) and increasing the app instance count so more unit to handle the requests. So far nothing has changed. I still see the log errors (no more, no less).

deivydas321 commented 1 year ago

We are experiencing these timeouts while the server load is < 10%.

Timeout awaiting response (outbound=0KiB, inbound=4KiB, 844ms elapsed, timeout is 500ms), command=EVAL, next: EVAL, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, last-in: 1, cur-in: 0, sync-ops: 0, async-ops: 5423477, serverEndpoint: xxx.redis.cache.windows.net:6380, conn-sec: 81518.1, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: 514f69175ded(SE.Redis-v2.6.122.38350), IOCP: (Busy=0,Free=1000,Min=10,Max=1000), WORKER: (Busy=4,Free=32763,Min=10,Max=32767), POOL: (Threads=22,QueuedItems=10,CompletedItems=116322835,Timers=137), v: 2.6.122.38350 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Is there any documentation of what Timers means? In my case, it is Timers=141 can it be related to the timeouts I experience?

Magnum1945 commented 9 months ago

@NickCraver We are experiencing timeouts sporadically. We get them on RedisHealthChecks for GetDatabase().PingAsync().ConfigureAwait(false) and server.PingAsync().ConfigureAwait(false). The only strange thing I can see is the large amount of async-ops.

"Timeout awaiting response (outbound=0KiB, inbound=0KiB, 5228ms elapsed, timeout is 5000ms), command=PING, next: INFO, inst: 0, qu: 0, qs: 0, aw: False, bw: Inactive, rs: ReadAsync, ws: Idle, in: 0, last-in: 4, cur-in: 0, sync-ops: 4, async-ops: 670885, serverEndpoint: ***.redis.cache.windows.net:6380, conn-sec: 273596.92, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: ***, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=3,Free=32764,Min=1,Max=32767), POOL: (Threads=6,QueuedItems=0,CompletedItems=32332523,Timers=7), v: 2.6.122.38350 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)"

Can you explain what can be the issue here?

NickCraver commented 9 months ago

@Magnum1945 How often do you see this? e.g. could it be when the servers are patched every ~6 weeks, or what kind of sporadic are we talking? Nothing in the message seems out of the ordinary with load of volume, that seems pretty reasonable overall so we're likely looking external.

Magnum1945 commented 9 months ago

@NickCraver Here is the statistic of errors for our US region: Dec 06, Dec 15, Jan 09, Jan 11, Jan 14. Every time the error lasts for around 15 minutes. The hours when the issue occurs are different every time so the is no pattern. According to our DevOps, there was no patching at that time.

philon-msft commented 9 months ago

The 15 minute duration sounds like you may be experiencing stalled sockets. I'd recommend upgrading StackExchange.Redis to v2.7.17 to get Fix #2595: Add detection handling for dead sockets that the OS says are okay, seen especially in Linux environments (#2610 by @NickCraver) That will help recover connections faster, in less than a minute rather than 15m, but won't eliminate the connection drops themselves. They seem to be occurring more frequently than can be explained by Azure Redis server maintenance. Most likely, there's another problem somewhere in your app environment causing Redis connections to be lost.

Magnum1945 commented 9 months ago

Thanks @philon-msft. I will upgrade the StackExchange.Redis to the latest stable as well as increase the worker threads min value. Although I still don't know why the problem is happening - not all the errors last for 15 minutes. Currently, I'm thinking about some bandwidth issues on the client side or between the client and Redis, but I'm not sure that's the case in my situation.

Shumaister commented 9 months ago

Hello there,

Here we've this log. We increase minimum count of threads too but see like no differences

StackExchange.Redis.RedisTimeoutException: Timeout performing HMGET (3ms), next: HMGET F2-Session 2646bd0e-6673-791f-5be1-49f54fdcefde, inst: 3, qu: 0, qs: 1, aw: False, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, serverEndpoint: redis.fargate-services.local:6379, mc: 1/1/0, mgr: 10 of 10 available, clientName: ip-10-0-130-208, IOCP: (Busy=0,Free=1000,Min=200,Max=1000), WORKER: (Busy=4,Free=32763,Min=200,Max=32767), v: 2.2.4.27433 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

NickCraver commented 8 months ago

@Shumaister What are your settings? That message indicates it timed out doing a multi-get in 3ms on a version with only ~1,000ms fidelity in timeout checks. That seems to me like it's set for 3ms, rather than 3 seconds as intended perhaps.

hacst commented 8 months ago

@philon-msft

They seem to be occurring more frequently than can be explained by Azure Redis server maintenance. Most likely, there's another problem somewhere in your app environment causing Redis connections to be lost.

Could you clarify whether a stalled socket is expected in regular Azure Redis server maintenance? It would seem very strange to me if regular maintenance didn't take care to close the sockets instead of just dropping them. However we also see these stalls with Azure Cache for Redis in an AKS service in unexpected frequency so I am wondering.

philon-msft commented 8 months ago

Routine Azure Redis server maintenance closes connections gracefully to avoid stalled sockets, but in some cases maintenance on underlying Azure infrastructure (networking or compute) can cause connections to be interrupted without graceful closure, leading to stalled sockets on clients. Client apps running in Kubernetes often experience more connection issues like this because there are more networking pieces involved, like the Kubernetes nodes which undergo maintenance themselves. Sidecars like Istio or Envoy can also disrupt Redis connections if they haven't been configured properly to ignore/allow all Redis traffic. We have some best practices documented here: https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-best-practices-kubernetes

StackExchange / StackExchange.Redis

Timeout unreasonably longer than AsyncTimeout #2477