RedisTimeoutException performing GET \ SETEX

phil000 commented 12 months ago

We've recently started receiving periodic timeout exceptions, up to about 20-30 a day.

There is no reason for it as far as we can see. Our app server load is low, and cache usage is also low.

We are using Azure Redis Standard C1. Server load is 0 - 40% and is mainly caused by replication. Memory usage is 100MB

The reader state "rs" is always ReadAsync

Any idea what could be going wrong.

e.g.

RedisCacheProvider: Failed to get item. Timeout performing GET (5000ms), 
next: GET 2:ProductService::AllTagsMetadata+PROD, 
inst: 0,
qu: 0, 
qs: 1, 
aw: False, 
bw: Inactive, 
rs: ReadAsync, 
ws: Idle, 
in: 0, in-pipe: 0, out-pipe: 0, last-in: 866, cur-in: 0, 
sync-ops: 17, async-ops: 1, 
serverEndpoint: x.redis.cache.windows.net, 
conn-sec: 519.67, 
aoc: 0, 
mc: 1/1/0, 
mgr: 10 of 10 available, 
IOCP: (Busy=0,Free=1000,Min=100,Max=1000), 
WORKER: (Busy=1,Free=1022,Min=100,Max=1023), 
POOL: (Threads=4,QueuedItems=0,CompletedItems=1902,Timers=15), 
v: 2.6.122.38350

Call stack:

StackExchange.Redis.RedisTimeoutException:
   at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl (StackExchange.Redis, Version=2.0.0.0, Culture=neutral, PublicKeyToken=c219ff1ca8c2ce46: /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:2097)
   at StackExchange.Redis.RedisBase.ExecuteSync (StackExchange.Redis, Version=2.0.0.0, Culture=neutral, PublicKeyToken=c219ff1ca8c2ce46: /_/src/StackExchange.Redis/RedisBase.cs:62)
   at StackExchange.Redis.RedisDatabase.StringGet (StackExchange.Redis, Version=2.0.0.0, Culture=neutral, PublicKeyToken=c219ff1ca8c2ce46: /_/src/StackExchange.Redis/RedisDatabase.cs:2995)
   at Lib.Core.Caching.Redis.RedisCacheProvider.GetItem (Lib, Version=2.81.7.0, Culture=neutral, PublicKeyToken=null: /_/Lib/Core/Caching/Redis/RedisCacheProvider.cs:132)

General

The StackExchange.Redis client is a singleton
We are using the latest version 2.6.122
Our cache usage is light, maybe 1-2 requests per second

Traffic burst and thread pool configuration

There is no traffic burst, average CPU is 5% maybe rising to 15% infrequently
Thread pools are configured with 100 min threads
There are 0 'UnresponsiveClients' errors

Large key value

We don't have excessively long key values. The largest key value is 1137 bytes

High CPU on client hosts

Average CPU is 5% maybe rising to 15% infrequently.
There are 0 'UnresponsiveClients' errors

Network bandwidth limitation on client hosts

Should be piles of bandwidth. Certainly enough to make 1 or 2 cache requests a second

Long running commands

The largest SLOWLOG GET command is "211936" microseconds. A management command 'setclientaddr'
server_load:1.38

CRidge commented 11 months ago

We are seeing a very similar situation. Similar value size, at times a bit higher frequency, but not more than 5 requests per second, commands like setclientaddr are the entries in the SLOWLOG and nothing out of the usual in the metrics. Our problems started about a week ago, and a week of work trying to figure it out hasn't yielded any results.

We're using Standard C2, and I can't see we're doing anything which should indicate that C2 shouldn't be more than enough. We're not doing anything different in the way we use the cache compared to the last year or so, when all has been working as expected.

We are not seeing timeouts very often, but very frequently gets/sets take seconds - while the exact same calls take very few milliseconds other times. We see no patterns.

phil000 commented 11 months ago

I've raised a support issue with Microsoft, but as per usual they can't see anything wrong and 'the issue must be on the client'.

We measure performance of all gets and sets and log aggregates every hour. Performance has become worse recently. Average GET is now 7-9ms

For comparison, we have now (temporarily or permanently) removed Redis from our architecture and migrated to Cosmos DB configured more as a cache, and average GET performance is 2-3ms using point lookups for the same data.

NickCraver commented 11 months ago

@phil000 Do you have some more sample exceptions? And what is the lifetime of the application? What stands out a bit is the conn-sec: 519.67 but this could be random with a sample size of one. Do your connection durations look sane for your expectations, or does it look like we're reconnecting more than the application is starting?

phil000 commented 11 months ago

The latest one.

RedisCacheProvider: Failed to set item. Timeout performing SETEX (5000ms), 
next: GET 2:RoutingService::GeocodingResult::943 alexandra street, te awamutu, te awamutu 3800, new zealand+PROD, 
inst: 0, 
qu: 0, 
qs: 3, 
aw: False, 
bw: SpinningDown, 
rs: ReadAsync, 
ws: Idle, 
in: 0, in-pipe: 0, out-pipe: 0, 
last-in: 2, cur-in: 0, 
sync-ops: 1904, async-ops: 1, 
serverEndpoint: , 
conn-sec: 11856.85, 
aoc: 0, mc: 1/1/0, 
mgr: 10 of 10 available, 
clientName: , 
IOCP: (Busy=0,Free=1000,Min=1,Max=1000), 
WORKER: (Busy=1,Free=32766,Min=2,Max=32767), 
POOL: (Threads=4,QueuedItems=0,CompletedItems=830046), 
v: 2.6.111.64013

The most recent ones (across different services): conn-sec: 11856.85 conn-sec: 11851.53 conn-sec: 1553.37 conn-sec: 11826 conn-sec: 15446 conn-sec: 15441 conn-sec: 4776.17 conn-sec: 4770.91 conn-sec: 8122.49 conn-sec: 8117.49 conn-sec: 8116.62 conn-sec: 4494.85 conn-sec: 861.21 conn-sec: 856.21 conn-sec: 856.03 conn-sec: 850.88

Our application is typically up for multiple days without restarting, so all of these look too short. At least some reconnections must be happening.

phil000 commented 11 months ago

We can no longer investigate this issue as we have removed Redis from our architecture.

StackExchange / StackExchange.Redis

RedisTimeoutException performing GET \ SETEX #2552