StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.89k stars 1.51k forks source link

Timeout awaiting response #2496

Open kaluznyt opened 1 year ago

kaluznyt commented 1 year ago

Hi,

I would like to ask about some advice regarding Redis timeouts. I'll try to describe the issue here:

We've encountering timeouts on Redis, but not all the time, happening each 4 hours, we think it's about we have batch jobs adding stuff to cache at those times, but still that's suspicious since those jobs are running all the time, but likely at those times there is a bit more load.

We have another app that uses the cache quite heavily, and records those redis timeouts, not many, and only around those particular times (each 4 hours, as mentioned).

We tried different things suggested mostly in the article from the error message (https://stackexchange.github.io/StackExchange.Redis/Timeouts)

We've introduced the connection multiplexer pooling, separated the connection multiplexer pool used by the endpoint that times out, so it doesn't interfere with another endpoint that's used by the batch job. Finally changed the MIN threads on the ThreadPool to 10, since the pod's have just 1 cpu, so it was MIN=1 by default.

But none of those got rid of the timeouts, althought there seem to be some improvements in terms of Redis latency, or on some parameters which are on the Timeout Exception (like qs is now quite low).

Do you see anything that can be improved based on those exceptions ?

This is one of sample Exceptions before we did any 'improvements': Stack trace Error message: Timeout awaiting response (outbound=188KiB, inbound=0KiB, 5040ms elapsed, timeout is 5000ms), command=MGET, next: MSET, inst: 0, qu: 0, qs: 705, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, last-in: 0, cur-in: 0, sync-ops: 0, async-ops: 40737620, serverEndpoint: , conn-sec: 213442.35, mc: 1/1/0, mgr: 10 of 10 available, clientName: (SE.Redis-v2.6.90.64945), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=2,Free=32765,Min=1,Max=32767), POOL: (Threads=5,QueuedItems=0,CompletedItems=487697475), v: 2.6.90.64945 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Most recent error messages: Stack trace Error message: Timeout awaiting response (outbound=45KiB, inbound=0KiB, 5220ms elapsed, timeout is 5000ms), command=MGET, next: MGET , inst: 0, qu: 0, qs: 123, aw: False, bw: Inactive, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, last-in: 178, cur-in: 0, sync-ops: 0, async-ops: 285, serverEndpoint: , conn-sec: 634.02, mc: 1/1/0, mgr: 9 of 10 available, clientName: (SE.Redis-v2.6.90.64945), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=4,Free=32763,Min=10,Max=32767), POOL: (Threads=15,QueuedItems=1,CompletedItems=1339961), v: 2.6.90.64945 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Stack trace Error message: Timeout awaiting response (outbound=59KiB, inbound=0KiB, 5450ms elapsed, timeout is 5000ms), command=MGET, next: MGET, inst: 0, qu: 0, qs: 111, aw: False, bw: Inactive, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, last-in: 0, cur-in: 0, sync-ops: 0, async-ops: 22695, serverEndpoint: , conn-sec: 746.1, mc: 1/1/0, mgr: 10 of 10 available, clientName: (SE.Redis-v2.6.90.64945), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=3,Free=32764,Min=10,Max=32767), POOL: (Threads=17,QueuedItems=0,CompletedItems=1600796), v: 2.6.90.64945 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Thank you for any hints/suggestions that might help to resolve this.

NickCraver commented 1 year ago

I agree with your assessment of it's batch jobs adding a lot, because your output is consistently with a significant buffer against the pipe that can't get out in time. There are 2 most likely contention points here:

  1. The direct connection to Redis on that multiplexer, being tied up by the batch sends to Redis and delaying other commands behind them.
  2. Overall bandwidth exhaustion, either in general or to the Redis endpoint.

I'm fuzzy on your attempted solution for a multiplexer per endpoint (this shouldn't change much of anything) - is there a separate mutiplexer for the batch job as well? If not, that'd be my recommendation, separate it's queue of commands from the others, giving it it's own socket to pipeline without delaying anything else trying to get a quick response from cache.

If it's actual bandwidth exhaustion, there's not much that'll be helped from separating since we're all being blocked by backpressure further down. In that case, you'd want to smooth out the spike from the batch job so that it doesn't exhaust bandwidth resources, e.g. delaying every so many or similar.

kaluznyt commented 1 year ago

I'm fuzzy on your attempted solution for a multiplexer per endpoint (this shouldn't change much of anything) - is there a separate mutiplexer for the batch job as well? If not, that'd be my recommendation, separate it's queue of commands from the others, giving it it's own socket to pipeline without delaying anything else trying to get a quick response from cache.

Yeah, batch job is actually calling one of the dedicated endpoints in the app, so that's why we thought about using dedicated multiplexer pool for each endpoint (both the one used by internal batch job which works mostly by adding stuff to cache, as well as other, used by the third parties which get the stuff from cache).

What we see consistent is that adding more multiplexers/connections, lowered CPU a little bit on Redis, as well as improved the latency a little bit on average. However, didn't fix the spikes/timeouts.

In the RedisLabs we also see latency spikes around those times, slowlog showing commands running around 20ms which I guess is long enough for Redis.

BTW Does it make sense to also change the Min on the IOCP Pool, we changed the WORKER Pool MIN to 10, however that setting hasn't affected the IOCP Pool (guess there is not csproj setting for this one).

kaluznyt commented 1 year ago
image
NickCraver commented 1 year ago

I can see from POOL in your exception that you're on .NET Core, so what I'd be looking at here is "what else is going on?" There are metrics for thread pool thread count and queue length, which are low at the time of the messages above, but I wonder if we're seeing a delayed reaction and inability to keep up due to an earlier spike given the brevity in your graphs here. Any chance you're already emitting those core metrics and can compare graphs?

kaluznyt commented 1 year ago

Yeah, took a look at the metrics, and it doesn't seem to be to much queueing going on in the ThreadPool, however I noticed there might be some chance of correlation between time spent in GC and those timeouts, need to look further into this.

NickCraver commented 1 year ago

Ah yeah, that'd pause, and depending on your mode, in a blocking way - you should have allocation counters to look at too there.