StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.85k stars 1.5k forks source link

Timeout with nothing queued? #2627

Open slorello89 opened 6 months ago

slorello89 commented 6 months ago

Hey @mgravell & @NickCraver

We have a customer who's encountering an odd timeout (or at least one I'd consider odd). Take a look at a couple of their timeouts:

Exception : Timeout performing EVAL (5000ms), next: EVAL, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, last-in: 1, cur-in: 892928, sync-ops: 195781, async-ops: 0, serverEndpoint: *****, conn-sec: 30658.25, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: ****, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=1,Free=32766,Min=4,Max=32767), v: 2.6.122.38350 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Exception : Timeout performing EVAL (5000ms), next: EVAL, inst: 11, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 131072, last-in: 1, cur-in: 94212, sync-ops: 58206, async-ops: 0, serverEndpoint: ****, conn-sec: 4432.43, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: ****, IOCP: (Busy=0,Free=1000,Min=20,Max=1000), WORKER: (Busy=45,Free=32722,Min=200,Max=32767), v: 2.6.122.38350

few background bits:

  1. The Evals are all coming out of Microsoft.Web.RedisSessionStateProvider. That's their usage of Redis, they are moving from in-memory session state -> Distributed session state
  2. The first thing I noticed was that cur-in was quite hefty at ~1MB & ~100kb respectively, pretty hefty for session state, they're pretty confident this is about as big as it gets for their sessions, I've asked them to confirm these sized, and their bandwidth to Redis appears to be sufficient to handle them in relatively low volumes.
  3. They do seem to be having large CPU spikes during the peak hours which we think is occuring in tandem with the timeouts.

I mean so far this is pretty standard stuff. But wait, notice how there's nothing queued awaiting to be sent, or queued awaiting a response? I think something's missing from this picture. It seems like it's not accurately counting the number of messages in the message queue. Have you encountered anything like this?

NickCraver commented 6 months ago

Interesting! Currently wondering:

  1. How big are those payloads are in full? (it seems like we're awaiting even more of a response here on the first one, and agree those seem quite large...and I've seen session state get pretty darn large)
  2. Anything in SLOWLOG server-side?
  3. How far away is the server, latency-wise?
  4. When you say CPU spikes: client or server? The second definitely looks like threads are spiking up, granted there's a min 200 but looks like decent server load compared to standard idle in the first.
  5. What kind of hardware/specs are we talking about here? Mostly curious about what kind of CPU we're talking about...is this a 2/4/8/etc. core machine?
slorello89 commented 6 months ago

Hi @NickCraver

  1. How big are those payloads are in full? (it seems like we're awaiting even more of a response here on the first one, and agree those seem quite large...and I've seen session state get pretty darn large)
  1. Anything in SLOWLOG server-side?
  1. How far away is the server, latency-wise?
  1. When you say CPU spikes: client or server? The second definitely looks like threads are spiking up, granted there's a min 200 but looks like decent server load compared to standard idle in the first.
  1. What kind of hardware/specs are we talking about here? Mostly curious about what kind of CPU we're talking about...is this a 2/4/8/etc. core machine?

We've asked them to try using beefier machines see if that resolves things for them. But regardless of whether it's a bandwidth/cpu bound thing, it's bizarre that we're seeing these errors when nothing is queued on either end of the message queue.

schoon commented 1 month ago

Any updates on this? Do you need any more information? The customer is still having the same problem.