StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.88k stars 1.51k forks source link

Frequent RedisTimeoutException #1979

Closed Dao007forever closed 2 years ago

Dao007forever commented 2 years ago

Hi,

We have trying to debug this problem of very frequent RedisTimeoutException.

Environment: .Net Framework 4.7.2 SE version: 2.2.73.50744 Main reader state: ReadAsync.

Sample error message: System.IO.IOException: Error executing RedisClient operation. ---> StackExchange.Redis.RedisTimeoutException: Timeout performing MGET (300ms), next: INFO, inst: 2, qu: 0, qs: 3, aw: True, rs: ReadAsync, ws: Writing, in: 211, in-pipe: 307, out-pipe: 0, serverEndpoint: XXX, mc: 1/1/0, mgr: 2 of 5 available, clientName: XXX, IOCP: (Busy=2,Free=998,Min=64,Max=1000), WORKER: (Busy=99,Free=32668,Min=1024,Max=32767), v: 2.2.73.50744 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

StackExchange.Redis.RedisTimeoutException: Timeout performing MGET (300ms), inst: 23, qu: 0, qs: 0, aw: False, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, serverEndpoint: XXX, mc: 1/1/0, mgr: 5 of 5 available, clientName: XXX, IOCP: (Busy=5,Free=995,Min=64,Max=1000), WORKER: (Busy=118,Free=32649,Min=1024,Max=32767), v: 2.2.73.50744 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

(all stats are 0, but still timeout?)

and less frequently

System.IO.IOException: Error executing RedisClient operation. ---> StackExchange.Redis.RedisTimeoutException: The timeout was reached before the message could be written to the output buffer, and it was not sent, command=GET, timeout: 300, outbound: 0KiB, inbound: 0KiB, next: GET XYZ, inst: 1, qu: 0, qs: 1, aw: True, rs: ReadAsync, ws: Writing, in: 0, in-pipe: 0, out-pipe: 0, serverEndpoint: XXX, mc: 1/1/0, mgr: 4 of 5 available, clientName: XXX, IOCP: (Busy=0,Free=1000,Min=64,Max=1000), WORKER: (Busy=108,Free=32659,Min=1024,Max=32767), v: 2.2.73.50744 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Previously, we noticed this could be a thread-starvation issue in https://github.com/StackExchange/StackExchange.Redis/blob/main/src/StackExchange.Redis/PhysicalConnection.cs#L1587 where the await input.ReadAsync().ForAwait(); is not scheduled (when Busy worker = min worker). We fixed that by increasing the MinThreads and limit the concurrency of handling requests.

Now in all of the Timeout Exception, Busy << Min.

We also check slowlog on servers and the slowest ones are ~15ms.

At this point, I'm not sure what else we should investigate.

NickCraver commented 2 years ago

Overall: The busy thread count is still very high - you've got a thread pool that's quite busy and likely unable to handle the load. The way I'd debug this is to get a dump when it happens and see what those 90+ threads are doing. It may be a lock or sync-over-async, or some other hold-up, but in any case it's very, very likely your exhaustion issue here.

If you're using Visual Studio, just dropping the dump file in there will give you some good debugging tools to look at threads. If not, there's other tooling but unsure what your options are to advise there. For example, here's a LINQPad sample I wrote to analyze a dump, so there are several free paths there.

Dao007forever commented 2 years ago

I did a dottrace but nothing unusual in our code. We do have sync-over-async but the Threadpool is set high enough with low concurrency (256 concurrent requests only) so I'm not sure what could be the problem.

Also this problem is worse when we migrate to Redis Cluster. Before that it was occasional, but much worse now.

Screen Shot 2022-02-08 at 8 23 22 PM

Lock contention:

Screen Shot 2022-02-08 at 8 25 48 PM
NickCraver commented 2 years ago

@Dao007forever If using dotTrace - look at the waiting threads in the timeline view you've got (or run a timeline view) - the threadpool workers are sleeping, not eating CPU and they're on their own thread pool, so won't be blocking the primary.

Dao007forever commented 2 years ago

Filtered to "Waiting" threads:

Screen Shot 2022-02-08 at 8 37 27 PM

Also, we are still on Synchronous controller and are using Sync calls.

To guard against thread-starvation, we need to have a high Min threads. Is there a way to prevent the starvation at https://github.com/StackExchange/StackExchange.Redis/blob/main/src/StackExchange.Redis/PhysicalConnection.cs#L1587 even in case of sync-over-async, like create a SynchronizationContext, to make sure the callback isn't enqueue to ThreadPool. Having reader loop execute by ThreadPool seems like a timed-bomb to me?

Update: Drop the minThreads to 64 and # mgr from 5 to 2 to reduce contention. We also try to remove as much sync-over-async to reduce chance of thread starvation. Much less frequent timeout, but still

System.IO.IOException: Error executing RedisClient operation. ---> StackExchange.Redis.RedisTimeoutException: Timeout performing GET (300ms), active: GET, next: GET ** possible thread-theft indicated; see https://stackexchange.github.io/StackExchange.Redis/ThreadTheft ** , inst: 82, qu: 0, qs: 4, aw: False, rs: CompletePendingMessageSync, ws: Idle, in: 0, in-pipe: 1664, out-pipe: 0, serverEndpoint: XXX, mc: 1/1/0, mgr: 1 of 2 available, clientName: XXX, PerfCounterHelperkeyHashSlot: 7674, IOCP: (Busy=0,Free=1000,Min=64,Max=1000), WORKER: (Busy=8,Free=32759,Min=64,Max=32767), v: 2.2.73.50744 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

or

System.IO.IOException: Error executing RedisClient operation. ---> StackExchange.Redis.RedisTimeoutException: Timeout performing GET (300ms), next: GET, inst: 3, qu: 0, qs: 5, aw: False, rs: ReadAsync, ws: Idle, in: 778, in-pipe: 0, out-pipe: 0, serverEndpoint: XXX, mc: 1/1/0, mgr: 0 of 2 available, clientName: XXX, PerfCounterHelperkeyHashSlot: 16335, IOCP: (Busy=0,Free=1000,Min=64,Max=1000), WORKER: (Busy=41,Free=32726,Min=64,Max=32767), v: 2.2.73.50744 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts) at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor1 processor, ServerEndPoint server)

NickCraver commented 2 years ago

Overall your peak thread count is going to be at your peak sync load. The way to eliminate this it to go full async, instead of tying up threads synchronously. Right now it looks like you have a very low timeout (max time anything ever can take) at 300ms and thread starvation - that combination is going to be very error-throwing-prone.

My suggestions are: fully embrace async, and also consider if 300ms as an absolute upper bound ever is the correct value.

Dao007forever commented 2 years ago

Agree. It's just tough to go full async currently but it's on our goals. I understand not many people are still using sync, but is having a dedicate thread for the reader loop going to benefit both sync and async?

NickCraver commented 2 years ago

@Dao007forever Which .NET version are you on? You can move to the shared threadpool already just to eliminate noise from the profile. In your ConfigurationOptions set the SocketManager, here's an example:

var conn = await ConnectionMultiplexer.ConnectAsync("localhost:6379", optins => options.SocketManager = SocketManager.ThreadPool);

(you can set it however you're constructing ConfigurationOptions)

Dao007forever commented 2 years ago

We are still on 4.7.2. We tried ThreadPool for SocketManager previously, but its tail latency is higher, i.e. perf is less predictable than the dedicated ones. We identified the threadpool pressure and resolved it, which improved the frequency of those timeout.

I'll close this now and re-open if anything happens. Thanks for your help, Nick!

NickCraver commented 2 years ago

Ahh gotcha, yes on .NET Full Framework that's not going to be awesome, but on .NET Core+ (especially 6.0+) it's much improved. If you can at all make the jump, cases like this are a huge reason to do so. I hope you're able to upgrade eventually to get all the wins on the table there - good luck!

chengang-hub commented 2 years ago

@NickCraver i need your help!!! thanks!

the COMPlus_ThreadPool_ForceMinWorkerThreads : 3027

I am also getting TimeoutException:

StackExchange.Redis.RedisTimeoutException: Timeout performing GET (5000ms), next: GET Identity:Session:1j8fP7Ai2ESGgEWcKOMN+pkk, inst: 3, qu: 0, qs: 10, aw: False, rs: ReadAsync, ws: Idle, in: 8384, in-pipe: 0, out-pipe: 0, serverEndpoint: , mgr: 9 of 10 available, clientName: yingkehuiyuan-5db544c847-t9vll, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=2963,Free=29804,Min=12402,Max=32767), v: 2.0.601.3402 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

StackExchange.Redis.RedisTimeoutException: Timeout performing GET (5000ms), next: GET Identity:Session:1j8fP7Ai2ESGgEWcKOMN+pkk, inst: 3, qu: 0, qs: 10, aw: False, rs: ReadAsync, ws: Idle, in: 8384, in-pipe: 0, out-pipe: 0, serverEndpoint: , mgr: 9 of 10 available, clientName: yingkehuiyuan-5db544c847-t9vll, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=2980,Free=29787,Min=12402,Max=32767), v: 2.0.601.3402 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

StackExchange.Redis.RedisTimeoutException: Timeout performing GET (5000ms), next: GET Identity:Session:1j8fP7Ai2ESGgEWcKOMN+pkk, inst: 3, qu: 0, qs: 10, aw: False, rs: ReadAsync, ws: Idle, in: 8384, in-pipe: 0, out-pipe: 0, serverEndpoint: mgr: 9 of 10 available, clientName: yingkehuiyuan-5db544c847-t9vll, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=3017,Free=29750,Min=12402,Max=32767), v: 2.0.601.3402 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

StackExchange.Redis.RedisTimeoutException: Timeout performing GET (5000ms), next: GET Identity:Session:1j8fP7Ai2ESGgEWcKOMN+pkk, inst: 3, qu: 0, qs: 10, aw: False, rs: ReadAsync, ws: Idle, in: 8384, in-pipe: 0, out-pipe: 0, serverEndpoint: , mgr: 9 of 10 available, clientName: yingkehuiyuan-5db544c847-t9vll, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=2963,Free=29804,Min=12402,Max=32767), v: 2.0.601.3402 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

StackExchange.Redis.RedisTimeoutException: Timeout performing GET (5000ms), active: GET Identity:Session:5/+Moz+Oz0uXO3T19EYe6cxq, next: GET Identity:Session:9PnUYOgWoUCasxxR5e8Jtkt3, inst: 4, qu: 0, qs: 2, aw: False, rs: CompletePendingMessage, ws: Idle, in: 0, in-pipe: 4086, out-pipe: 0, serverEndpoint: , mgr: 7 of 10 available, clientName: yingkehuiyuan-5db544c847-t9vll, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=1482,Free=31285,Min=12402,Max=32767), v: 2.0.601.3402 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

NickCraver commented 2 years ago

@chengang-hub please don't spam every issue with your exception - open a single issue and ask for help. Spamming everyone's email across 5 closed issues is not an appropriate use of GitHub.