StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.88k stars 1.51k forks source link

Timeout Exception diagnosis #2486

Open pianoman4873 opened 1 year ago

pianoman4873 commented 1 year ago

Hello , Could you please help diagnose the following timeout exception -

StackExchange.Redis.RedisTimeoutException: Timeout performing EXISTS (30000ms), next: INFO, inst: 61, qu: 0, qs: 294, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 95891, in-pipe: 0, out-pipe: 0, last-in: 58, cur-in: 0, sync-ops: 25616118, async-ops: 400996552, serverEndpoint: 10.71.88.247:6379, conn-sec: 128327.66, mc: 1/1/0, mgr: 10 of 10 available, clientName: IP-0A475317(SE.Redis-v2.6.90.64945), PerfCounterHelperkeyHashSlot: 16275, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=173,Free=32594,Min=16,Max=32767), POOL: (Threads=191,QueuedItems=5686,CompletedItems=2991371194), v: 2.6.90.64945

What caught my eye in:95891 - I assume a lot of data is pending in the queue so maybe that has to do with limited resources on the machine running the client ? BTW - Could this he somehow related to upgrading the OS in the client machines to Windows Server 2019 ( it was 2012 before ) ? If you could add any more insights that would be great.

We didn't see any increase load on the Redis servers that coincided with the exception time.

Thanks a lot !

NickCraver commented 1 year ago

The main problem you have here is thread pool exhaustion (POOL: (Threads=191,QueuedItems=5686,CompletedItems=2991371194)) - there are a lot of synchronous calls we see in sync-ops: 25616118, async-ops: 400996552 but there could by sync outside SE.Redis as well. In any case, when those happen you occupy a thread from the pool until some relatively very-long operations complete...eventually exhausting the pool.

If you take a memory dump under load, you can see where all those threads are stuck, it may be 1 or 2 hot spots worth making async ASAP.

pianoman4873 commented 1 year ago

Thanks a lot Nick.

vladislav-karamfilov commented 1 year ago

In order to prevent potential duplicate issue, I'll reuse this one to ask for help to figure out (potentially) similar problem.

During the last 14 days we observe an increased amount of StackExchange.Redis.RedisTimeoutException. Here is an example:

Timeout awaiting response (outbound=0KiB, inbound=0KiB, 5453ms elapsed, timeout is 5000ms), command=HGET, next: HGET realtime-user-connections, inst: 0, qu: 0, qs: 0, aw: False, bw: Inactive, rs: ReadAsync, ws: Idle, in: 0, last-in: 28, cur-in: 0, sync-ops: 0, async-ops: 65, serverEndpoint: <NAME>.redis.cache.windows.net:6380, conn-sec: 17390.36, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: <SERVER>(SE.Redis-v2.6.116.40240), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=2,Max=32767), POOL: (Threads=5,QueuedItems=0,CompletedItems=270344), v: 2.6.116.40240 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

We are using Azure Cache for Redis and we haven't made any changes on this resource; we haven't made any changes to how we access the data in Redis; the only related change we did during this period is to update the StackExchange.Redis package from 2.6.111 to 2.6.116 but there are timeouts before and after the update + revert didn't fix the issue; we see a slight increase of the Redis server max load but it's still well below the 40-50% (with just a few spikes to 70-80%); the average Redis server load is < 2.5% at all times but it also shows a slight increase in the last 14 days.

I realize that the issue is most probably in result of some internal change in the Azure Cache for Redis resource. @NickCraver , can you please point us where to look for the issue?

NickCraver commented 1 year ago

@vladislav-karamfilov I'd check SLOWLOG on the server in your case, it is possible that hash is quite a large payload?

vladislav-karamfilov commented 1 year ago

@NickCraver , we have investigated with SLOWLOG and found out that there were some commands with too many params which we have split into multiple commands; we have also optimized the stored value sizes for all Redis collections (although the hashes where we store them are very small - values in them are constantly being added and then removed so in 1 moment of time there are very few records); we have also increased the size of the worker threads created on demand without waiting because we have seen some timeout logs with busy worker threads > min worker threads. We hope that this will fix the issue because every other suggestion for improvement we found is already in place.

PS The exception message contains "mgr: 10 of 10 available" which at first I thought are the threads in the dedicated thread pool that are free to do some work, but having looked at the source code I found out that these are actually the threads that are actively doing some work. I know it's a nit but the message is somehow misleading.

NickCraver commented 1 year ago

Good to hear you found goodies!

PS The exception message contains "mgr: 10 of 10 available" which at first I thought are the threads in the dedicated thread pool that are free to do some work, but having looked at the source code I found out that these are actually the threads that are actively doing some work. I know it's a nit but the message is somehow misleading.

This is incorrect, the message is correct. It does however look like the comment in Pipelines.Sockets.Unofficial is backwards... (cc @mgravell): https://github.com/mgravell/Pipelines.Sockets.Unofficial/blob/0438f06057b0fc2e4edb9d7b4d2d6019e2933261/src/Pipelines.Sockets.Unofficial/DedicatedThreadPoolPipeScheduler.cs#L147

vladislav-karamfilov commented 1 year ago

Yes, that's exactly where I found the info that these active threads in the message are the ones that are busy with work, not the ones that are free to do some work. But if that's incorrect, why does the library use .NET's thread pool WORKER: (Busy=4,Free=32766,Min=2,Max=32767) instead of its own thread pool mgr: 10 of 10 available when it's completely free to do work? Shouldn't the dedicated thread pool be all busy before switching to the .NET's one?

NickCraver commented 1 year ago

These are for different things. The built-in thread pool is specifically for handling socket IO and making sure that stays flowing, which often unblocks some bad sync-over-async code upstream in applications calling us (preventing extended deadlocks that would otherwise occur with that pattern). It's takes the data off the socket and hands the task to the .NET thread pool for completing you want to do with it. It doesn't do all the work end to end, that would defeat the entire purpose of a dedicated thread pool for socket IO, because it'd be exhausted pretty much constantly.

You're assuming that because something is in an error message to help us diagnose, that we're using it directly. This isn't a valid assumption. What's in the error message is information to help us best advise those hitting it how they can improve the situation, and that includes some factors completely outside our control that do manifest in stalls/timeouts/etc., thread pool exhaustion being the absolute #1 thing people hit in their applications...we're just the ones alerting them that it's happening and that's why they get latency spikes.

vladislav-karamfilov commented 1 year ago

@NickCraver , thanks for clarifying this! 🙇‍♂️

We are monitoring our app logs and the Redis resource in Azure and it seems to be working fine after the improvements we've made. Thank you for the suggestion!

pianoman4873 commented 10 months ago

Adding more to this thread to avoid duplication -

I'm getting timeouts of this type -

(outbound=1KiB, inbound=0KiB, 30468ms elapsed, timeout is 30000ms), command=GET, next: INFO, inst: 0, qu: 0, qs: 109, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 131072, in-pipe: 0, out-pipe: 0, last-in: 1847, cur-in: 0, sync-ops: 223438, async-ops: 723026, serverEndpoint: 10.71.87.234:6379, conn-sec: 10458.31, aoc: 1, mc: 1/1/0, mgr: 9 of 10 available, clientName: IP-0A475518(SE.Redis-v2.6.116.40240), PerfCounterHelperkeyHashSlot: 8459, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=1090,Free=31677,Min=8,Max=32767), v: 2.6.116.40240 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

next: INFO, inst: 222, qu: 0, qs: 619, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 131072, in-pipe: 0, out-pipe: 0, last-in: 8191, cur-in: 0, sync-ops: 223435, async-ops: 723010, serverEndpoint: 10.71.86.100:6379, conn-sec: 10458.3, aoc: 1, mc: 1/1/0, mgr: 10 of 10 available, clientName: IP-0A475518(SE.Redis-v2.6.116.40240), PerfCounterHelperkeyHashSlot: 260, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=1090,Free=31677,Min=8,Max=32767), v: 2.6.116.40240 (Please take a look at this article for some common client-side issues that can cause

next: GET U_IPES.1:S_3017822, inst: 0, qu: 4, qs: 617, aw: True, bw: CheckingForTimeoutComplete, rs: ReadAsync, ws: Flushing, in: 131072, in-pipe: 0, out-pipe: 32, last-in: 256, cur-in: 1211, sync-ops: 223449, async-ops: 723073, serverEndpoint: 10.71.87.227:6379, conn-sec: 10458.57, aoc: 1, mc: 1/1/0, mgr: 8 of 10 available, clientName: IP-0A475518(SE.Redis-v2.6.116.40240), PerfCounterHelperkeyHashSlot: 260, IOCP: (Busy=4,Free=996,Min=8,Max=1000), WORKER: (Busy=1090,Free=31677,Min=8,Max=32767), v: 2.6.116.40240 (Please take a look at this article for some common client-side issues that can cause timeouts:

next: INFO, inst: 45, qu: 0, qs: 619, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 131072, in-pipe: 0, out-pipe: 0, last-in: 404, cur-in: 0, sync-ops: 223435, async-ops: 723010, serverEndpoint: 10.71.87.227:6379, conn-sec: 10458.3, aoc: 1, mc: 1/1/0, mgr: 10 of 10 available, clientName: IP-0A475518(SE.Redis-v2.6.116.40240), PerfCounterHelperkeyHashSlot: 260, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=1090,Free=31677,Min=8,Max=32767), v: 2.6.116.40240 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

xt: INFO, inst: 0, qu: 1, qs: 624, aw: True, bw: CheckingForTimeoutComplete, rs: ReadAsync, ws: Flushing, in: 131072, in-pipe: 661, out-pipe: 59, last-in: 8191, cur-in: 0, sync-ops: 223445, async-ops: 723058, serverEndpoint: 10.71.86.100:6379, conn-sec: 10458.34, aoc: 1, mc: 1/1/0, mgr: 8 of 10 available, clientName: IP-0A475518(SE.Redis-v2.6.116.40240), PerfCounterHelperkeyHashSlot: 260, IOCP: (Busy=2,Free=998,Min=8,Max=1000), WORKER: (Busy=1090,Free=31677,Min=8,Max=32767), v: 2.6.116.40240 (Please take a look at this article for some common client-side issues that can cause timeouts:

They are very similar. The main issue I see is the high amount of WORKER threads ( Busy = 1090 ) - This could be perhaps due to again the high number of sync-ops: 223445 ? So the remedy would be to try to move to sync as much as possible ?

Any additional insights ?

Thanks !

PS btw the sync operations don't have to be necessarily Redis related correct ?