StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.84k stars 1.5k forks source link

Timeout exceptions #2681

Closed mattchidley closed 1 month ago

mattchidley commented 3 months ago

Can someone help me understand these exceptions and maybe point at what could be going wrong for us?

Here is our exception:

Timeout awaiting response (outbound=234457KiB, inbound=35KiB, 5020ms elapsed, timeout is 5000ms), command=XADD, next: XREAD X:datachanged, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: TryParseResult, ws: Idle, in: 0, last-in: 15, cur-in: 240084771, sync-ops: 0, async-ops: 898910, serverEndpoint: X, conn-sec: 3254.58, aoc: 0, mc: 1/1/0, mgr: 9 of 10 available, clientName: X(SE.Redis-v2.7.27.49176), IOCP: (Busy=0,Free=2000,Min=2000,Max=2000), WORKER: (Busy=114,Free=32653,Min=2000,Max=32767), POOL: (Threads=114,QueuedItems=3,CompletedItems=68656823,Timers=162), v: 2.7.27.49176

What is unclear is: is outbound the total size in queue to be sent to Redis, or is it just for this command?

We are using the ConnectionMultiplexer, I have read https://stackexchange.github.io/StackExchange.Redis/Timeouts, but some of these abbreviations are not noted in this guide.

If you could help shed some light on what could be going wrong for us, that'd be great. Essentially what we're experiencing is threads piling up with these timeouts, and then we appear to lose connection/stop communicating with Redis entirely (because we're constantly timing out?)

mgravell commented 3 months ago

unless my math is wrong... there's 234MiB traffic down the pipe between when the message was enqueued and now; that's quite a lot of data (although data volume is subjective and contextual) - so I'm guessing you're simply saturating bandwidth and/or server load - plus of course that will put a non-trivial strain on the client. Do those numbers sound right for your usage?

mattchidley commented 3 months ago

We're a pretty large application with many threads writing to Redis at any given time, and we are using async. Our Redis server is not under load in that CPU and Memory look fine. I'll have to look into network throughput.

Can you confirm. This command for example was to add a new message to the stream (of unknown size in these logs), and that timed out because during this action another 234MiB was added to the stream?

Are there Redis server logs which might help us? What could we look for?

Just to add, on our Redis server we are seeing these logs:

34490:C 19 Mar 2024 12:15:48.806 Fork CoW for RDB: current 2 MB, peak 2 MB, average 2 MB 8:M 19 Mar 2024 12:15:49.015 Background saving terminated with success 8:M 19 Mar 2024 12:16:50.065 10000 changes in 60 seconds. Saving... 8:M 19 Mar 2024 12:16:50.243 Background saving started by pid 34761 34761:C 19 Mar 2024 12:17:44.438 DB saved on disk 34761:C 19 Mar 2024 12:17:44.640 Fork CoW for RDB: current 2 MB, peak 2 MB, average 2 MB 8:M 19 Mar 2024 12:17:44.892 Background saving terminated with success 8:M 19 Mar 2024 12:18:45.033 10000 changes in 60 seconds. Saving... 8:M 19 Mar 2024 12:18:45.213 Background saving started by pid 35056 35056:C 19 Mar 2024 12:19:38.109 DB saved on disk 35056:C 19 Mar 2024 12:19:38.307 Fork CoW for RDB: current 660 MB, peak 660 MB, average 660 MB 8:M 19 Mar 2024 12:19:38.605 Background saving terminated with success 8:M 19 Mar 2024 12:20:39.014 10000 changes in 60 seconds. Saving... 8:M 19 Mar 2024 12:20:39.198 Background saving started by pid 35314 35314:C 19 Mar 2024 12:21:28.978 DB saved on disk 35314:C 19 Mar 2024 12:21:29.177 Fork CoW for RDB: current 14 MB, peak 14 MB, average 14 MB

mgravell commented 3 months ago

The most useful server side thing is always: slowlog; does that report any concerning operations? My comment on number is based on the numbers in the exception; if I had to guess: some fairly chunky BLOBs moving around rather than pure volume of commands. I have some work actively in progress to remove a bottleneck in the processing loop, which could also be a factor with these numbers. Unfortunately, timeouts are inherently a little tricky to opine on - from the clients perspective, all we know is "we didn't process the result of X in time" - we can't say exactly where it got stuck, or how long each step from server to client to your application took.

mattchidley commented 3 months ago

Thanks @mgravell can you confirm that the outbound=234457KiB is data to be sent to redis, or data to be sent from redis?

One thing we just learned is that we are using XREAD commands and we have not been setting the count.

mgravell commented 3 months ago

I need to check (not at a PC), but I think that's the delta in total bytes sent between enqueue and timeout; however, the number would make more sense as the total bytes received, so please do let me double check that. Due to how the buffers currently work, it is possible that we've inverted them. I'll check.

mattchidley commented 3 months ago

@mgravell is there a doc somewhere that describes these debug logs?

mgravell commented 3 months ago

There is a link in the exception. It is hard to be exhaustive on all interpretations of the data, though.

NickCraver commented 1 month ago

Closing out to tidy up, but happy to re-open if there's a follow-up question.