StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.9k stars 1.51k forks source link

Getting timeouts from client despite server processing all requests quickly #2257

Open Yaytay opened 2 years ago

Yaytay commented 2 years ago

We have a dotnet core process running on linux (usually in docker, but the behaviour is the same without it). Using StackExchange.Redis 2.6.66 and dotnet SDK 6.0.8.

We are having a problem where all processing hangs for seconds - new requests are received but nothing returns until a few seconds later when it all comes back. I have a rather messy Java command that tests this and outputs an SVG that shows the individual requests and when they are blocked: picture-ats-staging-30-30-1663920632056 Each vertical bar there is second, each row is a thread and each box is a request - the colours are different Host headers sent on the requests. There is no load on the dotnet VM at the time of the testing other than the test itself, but the redis VM is shared whilst we are running on a bare VM (in docker it's dedicated).

Each request makes ~14 redis GETs (some of which are duplicates) and then returns a 302 redirect.

Some of the requests are throwing an exception like this: Timeout performing GET (5000ms), next: GET :ModuleConfiguration, inst: 1, qu: 0, qs: 10, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 12106, in-pipe: 0, out-pipe: 0, serverEndpoint: :6379, mc: 1/1/0, mgr: 10 of 10 available, clientName: (SE.Redis-v2.6.66.47313), IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=10,Free=32757,Min=4,Max=32767), POOL: (Threads=10,QueuedItems=16,CompletedItems=258256), v: 2.6.66.47313 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts) at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor`1 processor, ServerEndPoint server, T defaultValue) in //src/StackExchange.Redis/ConnectionMultiplexer.cs:line 1886 at StackExchange.Redis.RedisDatabase.StringGet(RedisKey key, CommandFlags flags) in //src/StackExchange.Redis/RedisDatabase.cs:line 2959 at <...>.Core.Services.Caching.KeyPrefixedByHostNameRedisCache.TryRetrieveItem[T](String key) in

In order to make the error occur we have to hit it with 10+ threads, at low loads it's all happy. The VM it's running on has four cores.

I have packet captures of the entire test sequence and cannot see any requests to redis taking more than a few milliseconds. It's hard to be 100% sure of this because it has ~28k packets in it and the multiplexing means I'm not sure that I am allocating responses to requests correctly. We also have the redis server set with latency-monitor-threshold 100 and it is not reporting anything.

So this appears to be an issue with (our use of) the redis client. Are you able to provide recommendations for where we should dig further?

Thanks.

NickCraver commented 2 years ago

It looks like either a client system stall completely (check CPU at the time?) or something stalled the Redis side - recommend looking at processor logs for the client and SLOWLOG on the Redis server to see if something stalled it for a while there - none of your numbers look bad there it just seems that we're at the tail end beginning to process what's on the pipe at the end of 5 seconds, which is like almost exactly a 5 second stall which raises my eyebrows a bit.

One other thought is a huge payload blocking all others if that's in play here.

Yaytay commented 2 years ago

The worst I see in the slowlog is 200154, which I understand is 200ms - everything else is 10-15000. The CPU usage isn't more than 10-20% (the redis VM is single core, the dotnet VM is four core).

Am I right in thinking that multiplexed responses are strictly in the same order as the requests? Knowing that might help me identify something more in the packet captures - at the moment I do not think any requests are taking a more than a few ms.

How big is huge? A lot of responses are 2-3KB and there are some SETEX commands of ~15KB.

Thanks.

NickCraver commented 2 years ago

Am I right in thinking that multiplexed responses are strictly in the same order as the requests?

Yep, correct - we absolutely depend on that fact :)

How big is huge?

We're talking many megabytes usually here, on most systems like 50MB+ unless we're on tiny hardware...enough to saturate the pipe for that long basically.

Do you have more exceptions by chance? It can be you see one at the end or the start that isn't the best indicator, more samples may help advise here.

Yaytay commented 2 years ago

Running a stack trace whilst the test is blocked gives 14 threads like this:

Thread (0x48C95): [Native Frames] StackExchange.Redis!StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl(class StackExchange.Redis.Message,class StackExchange.Redis.ResultProcessor`1<!!0>,class StackExchange.Redis.ServerEndPoint,!!0) StackExchange.Redis!StackExchange.Redis.RedisDatabase.StringGet(value class StackExchange.Redis.RedisKey,value class StackExchange.Redis.CommandFlags) CA.GtiAts.Infrastructure!CA.GtiAts.Core.Services.Caching.KeyPrefixedByHostNameRedisCache.TryRetrieveItem(class System.String)

(some say [Native Frames] and others say CPU_TIME)

And one thread like this: Thread (0x48C6A): [Native Frames] System.Private.CoreLib!System.Threading.WaitHandle.WaitOneNoCheck(int32) StackExchange.Redis!StackExchange.Redis.PhysicalBridge+d__97.MoveNext() System.Private.CoreLib!System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start(!!0&)

But this does not necessarily mean much, because I expect it to spend most of its time waiting for redis on these trivial code paths - that's the only RPC happening.

Yaytay commented 2 years ago

This is another error from one of the first to fail: Timeout performing GET (5000ms), next: GET [HOST]:ModuleConfiguration, inst: 1, qu: 0, qs: 22, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 18135, in-pipe: 0, out-pipe: 0, serverEndpoint: [REDIS]:6379, mc: 1/1/0, mgr: 10 of 10 available, clientName: VM, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=22,Free=32745,Min=4,Max=32767), POOL: (Threads=22,QueuedItems=11,CompletedItems=2521), v: 2.6.66.47313 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

and this is from later in the same lockup: Timeout performing GET (5000ms), next: GET [HOST]:ModuleConfiguration, inst: 1, qu: 0, qs: 22, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 17839, in-pipe: 0, out-pipe: 0, serverEndpoint: [REDIS]:6379, mc: 1/1/0, mgr: 10 of 10 available, clientName: VM, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=22,Free=32745,Min=4,Max=32767), POOL: (Threads=22,QueuedItems=11,CompletedItems=2521), v: 2.6.66.47313 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Looking at the packet capture between these times there are a few sequences of multiplexed requests, but, but even they are all replied in ms.

Yaytay commented 2 years ago

One thing is that we do have multiple concurrent requests for exactly the same key. This means that I can't categorically identify the failing requests from packet captures. I was also wondering whether there was anything in the library that might be upset by that.

Yaytay commented 2 years ago

The problem only seems to occur on Linux. Running the same dotnet app on a Windows VM with the same spec, same config, same dotnet version does not cause the same lockup to occur. Up to 100 client threads and no lockups.

NickCraver commented 1 year ago

I'm not sure what to suggest here because maybe a dotnet-dump trigger when it happens to catch it in the act? I can't think of anything Linux-specific we've seen that matches your symptoms (even close) but that's for sure a total stall. Whether it's network or CPU I'm not sure. In the next release unrelated to this I'm adding a few more numbers to exception messages (on MyGet now if curious) https://github.com/StackExchange/StackExchange.Redis/pull/2276 but I don't think it'll tell us much for your case.

But, you do have ~17-18K in the read buffer on the last 2 exceptions which means we're either stalling processing something off (possible!) or it's waiting for the socket to finish getting data (OS issue).