Open rayao opened 1 year ago
Do any of your other exceptions show higher thread counts? My best guess here is that under load you may have exhausted the thread pool causing timers not to fire for an extended duration (or you had a lot of timers ahead of us, preventing the next from firing?) - the trends for the POOL
section above across your error history would be enlightening there.
@NickCraver During the entire incident, thread count is below 100, seems to me not thread pool exhausted. In our own code timer creation rate is pretty low, likely 1 or 2 per 5 seconds. I collected some data here, hope that helps. We did close & reconnect Redis connection during the timeout burst, every VM here reconnected once, when avg/max elapsed dropped. The VM which has max queued Another VM Another VM which has relatively less queued
@NickCraver We are experiencing something similar everyday. StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=0KiB, inbound=3KiB, 5125ms elapsed, timeout is 5000ms), command=HMGET, next: INFO, inst: 0, qu: 0, qs: 0, aw: False, bw: Inactive, rs: ReadAsync, ws: Idle, in: 7693, last-in: 0, cur-in: 0, sync-ops: 2, async-ops: 297, serverEndpoint: servername.redis.cache.windows.net:6380, conn-sec: 67.51, mc: 1/1/0, mgr: 10 of 10 available, clientName: clname(SE.Redis-v2.6.96.30123), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=3,Free=32764,Min=1,Max=32767), POOL: (Threads=11,QueuedItems=22,CompletedItems=2557), v: 2.6.96.30123 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
Redis metrics from Azure App Insights:
Maybe we should reconfigure something in our Redis?
@sabaton1111 seems to me your min worker thread number is too low?
@rayao Thank you, I will give it a try. :)
@NickCraver do you think those statistics explain the delay? What could be done from consuming side to make the situation better?
@rayao I'm trying to reason about this - how many are in the bw: WritingMessage
state when you see timeouts? I wonder if you're in a case where we can send but not receive for some reason and the backlog is happily busy purging the queue downstream, not assessing timeouts because it's continually active...but this is purely a theory.
I'm looking at adding timer counts for .NET Core in the next release to timeout messages now, so we can rule that out easier.
To eliminate the timer overload cause, I'm adding this to logging in #2500.
@NickCraver I can't query log anymore, 5/30 log is purged now for too long.
To eliminate the timer overload cause, I'm adding this to logging in #2500.
@NickCraver, what is the expected count of timers? We observe some timeout exceptions from the last 3 days and all of them have >= 23 timers (1 of them even has 409). Is it too many or we should seek the issue elsewhere?
Here's an example of failed HGETALL from a collection that had 2 (short string) elements max at the time of the exception:
Timeout awaiting response (outbound=0KiB, inbound=0KiB, 5297ms elapsed, timeout is 5000ms), command=HGETALL, next: HGETALL asterisk-call-timers, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, last-in: 0, cur-in: 0, sync-ops: 0, async-ops: 121753, serverEndpoint: <server>, conn-sec: 43573.57, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: Public000000(SE.Redis-v2.6.122.38350), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=4,Free=32763,Min=10,Max=32767), POOL: (Threads=8,QueuedItems=0,CompletedItems=2142304,Timers=27), v: 2.6.122.38350 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
@NickCraver , We've also recently just updated and are seeing large numbers of timers
on timeout exceptions. We use Polly to do retries when a request fails and I'm wondering if this could be a reason that our Timers
is so high?
Sample exception:
Redis intermittent exception: StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=0KiB, inbound=0KiB, 16250ms elapsed, timeout is 10000ms), command=LRANGE, next: LRANGE value, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 33, last-in: 2, cur-in: 0, sync-ops: 637400, async-ops: 126159, serverEndpoint: redis-endpoint, conn-sec: 220544.69, aoc: 0, mc: 1/1/0, mgr: 9 of 10 available, clientName: workername(SE.Redis-v2.6.122.38350), IOCP: (Busy=0,Free=1000,Min=10,Max=1000), WORKER: (Busy=12,Free=32755,Min=10,Max=32767), POOL: (Threads=15,QueuedItems=2203,CompletedItems=199785096,Timers=29004), v: 2.6.122.38350 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
@vladislav-karamfilov Not worried about that timer count, on yours I'd check SLOWLOG on server-side as I don't see anything standing out there.
@stambunan whoa that's a lot of timers, holy crap. I can't say why they're so high (given the volume, I'd try a memory dump at any busy period and you'll probably get lucky with some volume in play). Normally we see this with things like spinning up a timer per command for timeouts or some such wrapping somewhere, so it could be your retry or could be something else.
Please note that we see some thread elevation, queued items, and sync operations against Redis, all pointing to some amount of thread pool starvation. Combined with really high timers, I'm not surprised you'd see some contention/timeouts under load. Recommendations would be: find out what those timers are, trim down, and eliminate sync ops to Redis.
@vladislav-karamfilov Not worried about that timer count, on yours I'd check SLOWLOG on server-side as I don't see anything standing out there.
Almost all of the commands in the output of SLOWLOG are not from our app. They are: setclientaddr
, INFO replication
, INFOINTERNAL timings
, ManagementCommand ACK
, ManagementCommand NO ONE`. I suspect that the issue is internal in the Azure resource because a couple of days ago I noticed that there were spikes of cache reads on approximately 1 hour interval and did a reboot and the issue was gone. But even without the spikes now, we still observe timeouts. Any suggestion how to proceed from here?
@vladislav-karamfilov it sounds like high Redis server load is causing commands to run slowly, causing even low-cost commands to appear in the SLOWLOG. I'd recommend investigating the high load - docs like this may be helpful: https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-troubleshoot-timeouts
Thanks @NickCraver . I think we found a different issue which caused the active timers to be so high. Now, when we see timeouts it's in the 200s but we do use a couple of timers in our code which might be the reason for this.
@vladislav-karamfilov it sounds like high Redis server load is causing commands to run slowly, causing even low-cost commands to appear in the SLOWLOG. I'd recommend investigating the high load - docs like this may be helpful: https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-troubleshoot-timeouts
This is our server load graph:
We did a lot of improvements to decrease the server load (smaller values, less connections, increased worker pool, optimized requests) at the end of June and beginning of July and it seemed that we fixed the issue around July 5th (https://github.com/StackExchange/StackExchange.Redis/issues/2486#issuecomment-1626900693). There were no timeout exceptions till July 25th and all of a sudden timeout exceptions started to appear in our Application Insights logs again. On July 23th we had a scheduled server maintenance (confirmed by following the steps from https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-troubleshoot-timeouts#server-maintenance). I think the new timeout exceptions wave is related to this maintenance event. As I wrote in my last comment, last week I also tried a reboot that seemed to have fixed the issue for a couple of days but after that we still observe timeout exceptions.
I can confirm that from 2 months or less we started to get these kind of errors in burst batches Timeout awaiting response (outbound=2KiB, inbound=513KiB, 6062ms elapsed, timeout is 5000ms), command=HMGET, next: HMGET XXXXXXXXXXXXX, inst: 0, qu: 34, qs: 53, aw: True, bw: WritingMessage, rs: DequeueResult, ws: Writing, in: 0, in-pipe: 526205, out-pipe: 0, last-in: 0, cur-in: 526205, sync-ops: 0, async-ops: 3676273, serverEndpoint: rd.redis.cache.windows.net:6379, conn-sec: 93130.31, aoc: 0, mc: 1/1/0, mgr: 9 of 10 available, clientName: RD0003FF552664(SE.Redis-v2.6.122.38350), IOCP: (Busy=0,Free=1000,Min=280,Max=1000), WORKER: (Busy=40,Free=32727,Min=280,Max=32767), POOL: (Threads=40,QueuedItems=32,CompletedItems=28313302,Timers=46), v: 2.6.122.38350
and I've tried all the suggestion I've mange to find - from increasing the Redis from C2 to C3 (increasing the bandwidth) and increasing the worker count (280 as you can see) and increasing the app instance count so more unit to handle the requests. So far nothing has changed. I still see the log errors (no more, no less).
We are experiencing these timeouts while the server load is < 10%.
Timeout awaiting response (outbound=0KiB, inbound=4KiB, 844ms elapsed, timeout is 500ms), command=EVAL, next: EVAL, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, last-in: 1, cur-in: 0, sync-ops: 0, async-ops: 5423477, serverEndpoint: xxx.redis.cache.windows.net:6380, conn-sec: 81518.1, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: 514f69175ded(SE.Redis-v2.6.122.38350), IOCP: (Busy=0,Free=1000,Min=10,Max=1000), WORKER: (Busy=4,Free=32763,Min=10,Max=32767), POOL: (Threads=22,QueuedItems=10,CompletedItems=116322835,Timers=137), v: 2.6.122.38350 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
Is there any documentation of what Timers
means? In my case, it is Timers=141 can it be related to the timeouts I experience?
@NickCraver
We are experiencing timeouts sporadically. We get them on RedisHealthChecks for GetDatabase().PingAsync().ConfigureAwait(false)
and server.PingAsync().ConfigureAwait(false)
. The only strange thing I can see is the large amount of async-ops.
"Timeout awaiting response (outbound=0KiB, inbound=0KiB, 5228ms elapsed, timeout is 5000ms), command=PING, next: INFO, inst: 0, qu: 0, qs: 0, aw: False, bw: Inactive, rs: ReadAsync, ws: Idle, in: 0, last-in: 4, cur-in: 0, sync-ops: 4, async-ops: 670885, serverEndpoint: ***.redis.cache.windows.net:6380, conn-sec: 273596.92, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: ***, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=3,Free=32764,Min=1,Max=32767), POOL: (Threads=6,QueuedItems=0,CompletedItems=32332523,Timers=7), v: 2.6.122.38350 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)"
Can you explain what can be the issue here?
@Magnum1945 How often do you see this? e.g. could it be when the servers are patched every ~6 weeks, or what kind of sporadic are we talking? Nothing in the message seems out of the ordinary with load of volume, that seems pretty reasonable overall so we're likely looking external.
@NickCraver Here is the statistic of errors for our US region: Dec 06, Dec 15, Jan 09, Jan 11, Jan 14. Every time the error lasts for around 15 minutes. The hours when the issue occurs are different every time so the is no pattern. According to our DevOps, there was no patching at that time.
The 15 minute duration sounds like you may be experiencing stalled sockets. I'd recommend upgrading StackExchange.Redis to v2.7.17 to get Fix #2595: Add detection handling for dead sockets that the OS says are okay, seen especially in Linux environments (#2610 by @NickCraver) That will help recover connections faster, in less than a minute rather than 15m, but won't eliminate the connection drops themselves. They seem to be occurring more frequently than can be explained by Azure Redis server maintenance. Most likely, there's another problem somewhere in your app environment causing Redis connections to be lost.
Thanks @philon-msft. I will upgrade the StackExchange.Redis to the latest stable as well as increase the worker threads min value. Although I still don't know why the problem is happening - not all the errors last for 15 minutes. Currently, I'm thinking about some bandwidth issues on the client side or between the client and Redis, but I'm not sure that's the case in my situation.
Hello there,
Here we've this log. We increase minimum count of threads too but see like no differences
StackExchange.Redis.RedisTimeoutException: Timeout performing HMGET (3ms), next: HMGET F2-Session 2646bd0e-6673-791f-5be1-49f54fdcefde, inst: 3, qu: 0, qs: 1, aw: False, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, serverEndpoint: redis.fargate-services.local:6379, mc: 1/1/0, mgr: 10 of 10 available, clientName: ip-10-0-130-208, IOCP: (Busy=0,Free=1000,Min=200,Max=1000), WORKER: (Busy=4,Free=32763,Min=200,Max=32767), v: 2.2.4.27433 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
@Shumaister What are your settings? That message indicates it timed out doing a multi-get in 3ms on a version with only ~1,000ms fidelity in timeout checks. That seems to me like it's set for 3ms, rather than 3 seconds as intended perhaps.
@philon-msft
They seem to be occurring more frequently than can be explained by Azure Redis server maintenance. Most likely, there's another problem somewhere in your app environment causing Redis connections to be lost.
Could you clarify whether a stalled socket is expected in regular Azure Redis server maintenance? It would seem very strange to me if regular maintenance didn't take care to close the sockets instead of just dropping them. However we also see these stalls with Azure Cache for Redis in an AKS service in unexpected frequency so I am wondering.
Routine Azure Redis server maintenance closes connections gracefully to avoid stalled sockets, but in some cases maintenance on underlying Azure infrastructure (networking or compute) can cause connections to be interrupted without graceful closure, leading to stalled sockets on clients. Client apps running in Kubernetes often experience more connection issues like this because there are more networking pieces involved, like the Kubernetes nodes which undergo maintenance themselves. Sidecars like Istio or Envoy can also disrupt Redis connections if they haven't been configured properly to ignore/allow all Redis traffic. We have some best practices documented here: https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-best-practices-kubernetes
Recently our Azure service encountered a Redis instance down incident, during the down time, our Redis client throws RedisTimeoutException with unreasonably long timeout time, like this:
On 1 of our VMs, we saw 102550 timeout exceptions in total, 92% of them longer than 2 seconds, 80% > 10 seconds, 72% > 20 seconds, 33% > 80 seconds, I can even see 977 exceptions with timeout > 120 seconds. This made thousands of requests stuck in Processing stage, and further resulted in some other very bad situations. I want to know if that's avoidable or not, if yes, how to configure the connection or client lib.