StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.87k stars 1.51k forks source link

RedisTimeoutException performing GET \ SETEX #2552

Closed phil000 closed 11 months ago

phil000 commented 12 months ago

We've recently started receiving periodic timeout exceptions, up to about 20-30 a day.

There is no reason for it as far as we can see. Our app server load is low, and cache usage is also low.

We are using Azure Redis Standard C1. Server load is 0 - 40% and is mainly caused by replication. Memory usage is 100MB

The reader state "rs" is always ReadAsync

Any idea what could be going wrong.

e.g.

RedisCacheProvider: Failed to get item. Timeout performing GET (5000ms), 
next: GET 2:ProductService::AllTagsMetadata+PROD, 
inst: 0,
qu: 0, 
qs: 1, 
aw: False, 
bw: Inactive, 
rs: ReadAsync, 
ws: Idle, 
in: 0, in-pipe: 0, out-pipe: 0, last-in: 866, cur-in: 0, 
sync-ops: 17, async-ops: 1, 
serverEndpoint: x.redis.cache.windows.net, 
conn-sec: 519.67, 
aoc: 0, 
mc: 1/1/0, 
mgr: 10 of 10 available, 
IOCP: (Busy=0,Free=1000,Min=100,Max=1000), 
WORKER: (Busy=1,Free=1022,Min=100,Max=1023), 
POOL: (Threads=4,QueuedItems=0,CompletedItems=1902,Timers=15), 
v: 2.6.122.38350

Call stack:

StackExchange.Redis.RedisTimeoutException:
   at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl (StackExchange.Redis, Version=2.0.0.0, Culture=neutral, PublicKeyToken=c219ff1ca8c2ce46: /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:2097)
   at StackExchange.Redis.RedisBase.ExecuteSync (StackExchange.Redis, Version=2.0.0.0, Culture=neutral, PublicKeyToken=c219ff1ca8c2ce46: /_/src/StackExchange.Redis/RedisBase.cs:62)
   at StackExchange.Redis.RedisDatabase.StringGet (StackExchange.Redis, Version=2.0.0.0, Culture=neutral, PublicKeyToken=c219ff1ca8c2ce46: /_/src/StackExchange.Redis/RedisDatabase.cs:2995)
   at Lib.Core.Caching.Redis.RedisCacheProvider.GetItem (Lib, Version=2.81.7.0, Culture=neutral, PublicKeyToken=null: /_/Lib/Core/Caching/Redis/RedisCacheProvider.cs:132)

General

Traffic burst and thread pool configuration

Large key value

High CPU on client hosts

Network bandwidth limitation on client hosts

Long running commands

CRidge commented 11 months ago

We are seeing a very similar situation. Similar value size, at times a bit higher frequency, but not more than 5 requests per second, commands like setclientaddr are the entries in the SLOWLOG and nothing out of the usual in the metrics. Our problems started about a week ago, and a week of work trying to figure it out hasn't yielded any results.

We're using Standard C2, and I can't see we're doing anything which should indicate that C2 shouldn't be more than enough. We're not doing anything different in the way we use the cache compared to the last year or so, when all has been working as expected.

We are not seeing timeouts very often, but very frequently gets/sets take seconds - while the exact same calls take very few milliseconds other times. We see no patterns.

phil000 commented 11 months ago

I've raised a support issue with Microsoft, but as per usual they can't see anything wrong and 'the issue must be on the client'.

We measure performance of all gets and sets and log aggregates every hour. Performance has become worse recently. Average GET is now 7-9ms

For comparison, we have now (temporarily or permanently) removed Redis from our architecture and migrated to Cosmos DB configured more as a cache, and average GET performance is 2-3ms using point lookups for the same data.

NickCraver commented 11 months ago

@phil000 Do you have some more sample exceptions? And what is the lifetime of the application? What stands out a bit is the conn-sec: 519.67 but this could be random with a sample size of one. Do your connection durations look sane for your expectations, or does it look like we're reconnecting more than the application is starting?

phil000 commented 11 months ago

The latest one.

RedisCacheProvider: Failed to set item. Timeout performing SETEX (5000ms), 
next: GET 2:RoutingService::GeocodingResult::943 alexandra street, te awamutu, te awamutu 3800, new zealand+PROD, 
inst: 0, 
qu: 0, 
qs: 3, 
aw: False, 
bw: SpinningDown, 
rs: ReadAsync, 
ws: Idle, 
in: 0, in-pipe: 0, out-pipe: 0, 
last-in: 2, cur-in: 0, 
sync-ops: 1904, async-ops: 1, 
serverEndpoint: , 
conn-sec: 11856.85, 
aoc: 0, mc: 1/1/0, 
mgr: 10 of 10 available, 
clientName: , 
IOCP: (Busy=0,Free=1000,Min=1,Max=1000), 
WORKER: (Busy=1,Free=32766,Min=2,Max=32767), 
POOL: (Threads=4,QueuedItems=0,CompletedItems=830046), 
v: 2.6.111.64013

The most recent ones (across different services): conn-sec: 11856.85 conn-sec: 11851.53 conn-sec: 1553.37 conn-sec: 11826 conn-sec: 15446 conn-sec: 15441 conn-sec: 4776.17 conn-sec: 4770.91 conn-sec: 8122.49 conn-sec: 8117.49 conn-sec: 8116.62 conn-sec: 4494.85 conn-sec: 861.21 conn-sec: 856.21 conn-sec: 856.03 conn-sec: 850.88

Our application is typically up for multiple days without restarting, so all of these look too short. At least some reconnections must be happening.

phil000 commented 11 months ago

We can no longer investigate this issue as we have removed Redis from our architecture.