StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.85k stars 1.5k forks source link

timeouts #2624

Open pianoman4873 opened 6 months ago

pianoman4873 commented 6 months ago

Hello, I'm getting hundreds of this timeout messages a day such as these : StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=0KiB, inbound=0KiB, 101984ms elapsed, timeout is 30000ms), command=GET, next: INFO, inst: 0, qu: 0, qs: 46, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 2148, last-in: 1177, cur-in: 0, sync-ops: 258073, async-ops: 581672, serverEndpoint: 10.71.87.120:6379, conn-sec: 10759.49, aoc: 1, mc: 1/1/0, mgr: 9 of 10 available, clientName: IP-0A475528(SE.Redis-v2.6.116.40240), PerfCounterHelperkeyHashSlot: 1284, IOCP: (Busy=0,Free=1000,Min=16,Max=1000), WORKER: (Busy=810,Free=31957,Min=16,Max=32767), v: 2.6.116.40240 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)\r\n at

StackExchange.Redis.RedisTimeoutException: The message timed out in the backlog attempting to send because no connection became available, command=GET, timeout: 30000, outbound: 0KiB, inbound: 1477KiB, inst: 12, qu: 85, qs: 0, aw: True, bw: CheckingForTimeoutComplete, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, last-in: 45, cur-in: 0, sync-ops: 457947, async-ops: 956159, serverEndpoint: 10.71.86.100:6379, conn-sec: 17190.52, aoc: 1, mc: 1/1/0, mgr: 10 of 10 available, clientName: IP-0A4755D1(SE.Redis-v2.6.116.40240), PerfCounterHelperkeyHashSlot: 435, IOCP: (Busy=0,Free=1000,Min=16,Max=1000), WORKER: (Busy=456,Free=32311,Min=16,Max=32767), v: 2.6.116.40240 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Look at the machine - it's CPU & memory usage are well below 30%. No issues at the Redis servers ( hosted in AWS cluster ).

Unfortunately, can't get rid of the sync operations due to legacy code. Any other suggestion of how to tackle this or additional insights from the error message ?

Specifically for "The message timed out in the backlog attempting to send because no connection became available" - which connection are we talking about here ? doesn't StackExchange.Redis keep a single connection , and from mc:1/1/0 seems it's functionable and active, isn't it ?

Any help would be highly appreciated Thanks !

NickCraver commented 6 months ago

Overall, you're seeing thread starvation here: WORKER: (Busy=456,Free=32311,Min=16,Max=32767) - I know you said "Unfortunately, can't get rid of the sync operations due to legacy code." but...that's what need to happen if you want to solve this problem of thread exhaustion. The more you make async, the better it gets. I recommend taking memory dumps to see what your worst offenders are, maybe you can drastically improve the situation concentrating on specific hot paths causing the most blocking by volume.

pianoman4873 commented 6 months ago

OK, will try that. Thanks Is the message ""The message timed out in the backlog attempting to send because no connection became available" - also related and if so can you explain how please ? Also, how can you explain the low CPU usage ( below 30 % ) along side with the thread starvation ? Wouldn't you expect the CPU consumption to be higher ?

NickCraver commented 6 months ago

It's very likely the same issue, yes. CPU usage == time spent doing something. Thread starvation doesn't equate to that, it's time waiting (often idle) to get a thread. You can have thread starvation and 0% CPU for example.

.NET only allocates a new thread every 500ms, that's a relative eternity, so it takes a very long time to unblock if you're continuing to pile up while exhausted.

pianoman4873 commented 6 months ago

OK, Thanks a lot of your input. Maybe changing the thread allocation delay you mentioned could also be helpful? Also, and how should we regard the fact that "mgr: 10 of 10 available" - doesn't it mean the threads from this dedicated thread pool could be used? Thanks !

NickCraver commented 6 months ago

The dedicated thread pool exists so that you're not waiting on data from the socket - the client produces it as fast as possible for the main thread pool to deal with it. It does not handle your application code, only the socket and very specifically only that. Imagine for a moment if it was "just another 10 threads" to use for anything - that 456 used would be 466 used instead and you'd just fall over even quicker because lack of reads off the socket means earlier threads aren't completing and the spiral of death becomes much steeper.

On the other question: no you can't change the thread allocation delay. You can only set the number of threads created before that delay starts happening (ThreadPool.SetMinThreads), but please note that's a bandaid and not free: threads cost to startup and eat memory for their stack space. But sometimes, it's an appropriate bandaid to buy time to fix the real "needs async" issues.

pianoman4873 commented 6 months ago

Thanks for the thorough explanation, I understand your point but still need clarifying if you could about which part exactly fails when getting the " Timeout awaiting response (outbound=0KiB, inbound=0KiB, 101984ms elapsed, timeout is 30000ms), command=GET". Why do we see this timeout, assuming the Redis servers are responding timely ? I thought until now that this indicates the data wasn't read from the sockets. How exactly are the worker threads related to this ?

Thanks for helping me figuring this out :)