High CPU and high get duration with timeout errors (v 2.1.58)

kadikoy commented 4 years ago

Story and setup

We are trying to replace AppFabric with Redis. We have a switch that we can enable Redis. As soon as we see problems, we switch back to AppFabric.

Servers of the application and the Redis cluster are on the same VPC on AWS. We have 4 shards on the cluster and each shard has one primary and one replica node. We have a connection pool for Redis on every application server with capacity of 10.

When the load is low there is no problem, even faster than AppFabric. But we see high duration during peak times, so we believe there's a limit that Redis starts giving problems. It also effects the CPU by the way.

We couldn't find any problems on the Redis cluster (no slowlog etc, also see graphs at the bottom), so we suspect the problem is on the client side.

Redis configuration

Dim configOptions As New ConfigurationOptions() With {
    .AbortOnConnectFail = False,
    .AllowAdmin = True,
    .ConnectTimeout = 3000,
    .DefaultDatabase = 0,
    .Password = Password,
    .Ssl = True,
    .SyncTimeout = 1500,
    .SocketManager = New SocketManager("CustomSocketManager", 40)
}

configOptions.EndPoints.Add(Host, Port)

' StackExchange.Redis creates an extra connection for pub/sub which we don't use.
configOptions.CommandMap = CommandMap.Create(New HashSet(Of String) From {"SUBSCRIBE"}, False)

We have two types of cache objects:

Without version: We use simple StringSet and StringGet for these.
With version:

PUT
    Dim tran As ITransaction = redisDatabase.CreateTransaction()
    tran.AddCondition(Condition.HashEqual(CacheKey, "version", version))
    Dim hashEntry As HashEntry() = {
        New HashEntry("version", newVersion),
        New HashEntry("value", _redisSerializer.Serialize(CacheItem))
    }
    tran.HashSetAsync(CacheKey, hashEntry)
    tran.KeyExpireAsync(CacheKey, Expiration)
    Dim result As Boolean = tran.Execute(CommandFlags.DemandMaster)

GET
    Dim hashEntries As HashEntry() = redisDatabase.HashGetAll(CacheKey, CommandFlags.PreferReplica)
    Dim cacheItem As HashEntry = hashEntries.FirstOrDefault(Function(m) m.Name.ToString() = "value")
    CacheItem = _redisSerializer.Deserialize(Of T)(cacheItem.Value)

Errors

Last time Redis was enabled for 8 minutes. We had two types of timeout errors:

The timeout was reached before the message could be written to the output buffer, and it was not sent, command=GET(or HGETALL or EXEC) occurred 506 times:

The timeout was reached before the message could be written to the output buffer, and it was not sent, command=GET, timeout: 1500, outbound: 12KiB, inbound: 1KiB, next: GET xxx, inst: 1, qu: 0, qs: 1, aw: True, rs: ReadAsync, ws: Flushing, in: 0, serverEndpoint: xxx.cache.amazonaws.com:6379, mc: 1/1/0, mgr: 37 of 40 available, clientName: xxx, PerfCounterHelperkeyHashSlot: 12414, IOCP: (Busy=4,Free=396,Min=200,Max=400), WORKER: (Busy=129,Free=271,Min=240,Max=400), v: 2.1.58.34321 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
InnerException.StackTrace ( InnerExceptionName=StackExchange.Redis.RedisTimeoutException ):
   at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor`1 processor, ServerEndPoint server)
   at StackExchange.Redis.RedisBase.ExecuteSync[T](Message message, ResultProcessor`1 processor, ServerEndPoint server)
   at StackExchange.Redis.RedisDatabase.StringGet(RedisKey key, CommandFlags flags)
   at StackExchange.Redis.KeyspaceIsolation.DatabaseWrapper.StringGet(RedisKey key, CommandFlags flags)

This is only one of the errors. Please tell me which parameter you want to see (mgr, in, rs, busy worker thread etc), I'm ready to provide all the 506 values to make it easier to analyze.

Timeout performing GET (or HGETALL or UNLINK) occurred 14 times:

Timeout performing GET (1500ms), inst: 7, qu: 0, qs: 0, aw: False, rs: ReadAsync, ws: Idle, in: 0, serverEndpoint: xxx.cache.amazonaws.com:6379, mc: 1/1/0, mgr: 39 of 40 available, clientName: xxx, PerfCounterHelperkeyHashSlot: 264, IOCP: (Busy=2,Free=398,Min=200,Max=400), WORKER: (Busy=182,Free=218,Min=240,Max=400), v: 2.1.58.34321 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
InnerException.StackTrace ( InnerExceptionName=StackExchange.Redis.RedisTimeoutException ):
   at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor`1 processor, ServerEndPoint server)
   at StackExchange.Redis.RedisBase.ExecuteSync[T](Message message, ResultProcessor`1 processor, ServerEndPoint server)
   at StackExchange.Redis.RedisDatabase.StringGet(RedisKey key, CommandFlags flags)
   at StackExchange.Redis.KeyspaceIsolation.DatabaseWrapper.StringGet(RedisKey key, CommandFlags flags)

Cloudwatch graphs (timespan = 1 minute)

Cloudwatch Host Metrics 28 July Cloudwatch Redis Metrics 28 July

bklooste commented 4 years ago

Do you have enough threads ?

        ThreadPool.SetMinThreads(200, 200);

Have you tried without transactions i think with redis they will queue things and then you keep adding more .

run the key expire as normal not in a transaction. If it solves your issue you can thing about how to get it better.

Worth rereading https://stackexchange.github.io/StackExchange.Redis/Transactions.html

kadikoy commented 4 years ago

@bklooste Thanks for the reply,

Yes, we have enough threads. You can see it in the exceptions that I provided:

IOCP: (Busy=4,Free=396,Min=200,Max=400), WORKER: (Busy=129,Free=271,Min=240,Max=400)
IOCP: (Busy=2,Free=398,Min=200,Max=400), WORKER: (Busy=182,Free=218,Min=240,Max=400)

We still need the transaction even if I call the KeyExpire separately. The condition is the main reason that we use the transaction:

tran.AddCondition(Condition.HashEqual(CacheKey, "version", version))

We have a logic that it collects the changes in memory and put the data at the end of the process. If the data is already changed in the meantime, then we don't want to override it. The "version" key helps us to understand if the data is already changed or not.

NickCraver commented 4 years ago

What type of hardware are you running on? The 129-182 threads is a lot to be synchronously running. Do you happen to have graphs of the thread pool in general? Generally speaking, 1,500ms is a low timeout for a cloud environment and you may be going over it. When you're comparing these spikes to AppFabric, what timeout is in play there and how does it behave?

I ask these questions because often what we see is that these spikes aren't new to apps or workloads...recognition of the stalls and actually throwing a timeout is the new piece. And at 1500ms, that threshold is fairly low for most cloud conditions (keeping in mind it's a worst case scenario, not the normal time).

kadikoy commented 4 years ago

About the timeout, we also use 1500ms for AppFabric. AppFabric has timeout errors as well, but we are pleased with the performance regarding duration.

I know it's the worst case scenario but think about the requests below 1500ms. They don't throw exception but log very high duration, which means bad performance. With Redis, we have both problems under high load. Some requests are throwing exception, some are not but they are slow. We have ~28 cache get request per page by the way. So even if we set a higher timeout, we'll still see high duration(which is our main problem compared to AppFabric).

Our web servers are C5.xlarge EC2 instances on AWS.

I'm not sure what you mean by "graphs of the thread pool in general". Here are some related graphs:

Physical and logical threads of the w3wp processes (clear difference with AppFabric) # of threads

Request counters (we have ~28 cache get call per page request) Requests

Statistics from all errors (520 in total) BusyWorker

NickCraver commented 4 years ago

Are you getting exceptions around that 1:40 mark? I'm curious what that rather huge thread pool increase is there, it indicates a lot of instantaneous load of some sort, which is usually a recipe for a pipe-clogging issue.

kadikoy commented 4 years ago

Yes, all exceptions are logged in ~8 minutes that we run with Redis. Between 13:40 to 13:49.

kadikoy commented 4 years ago

' StackExchange.Redis creates an extra connection for pub/sub which we don't use.
configOptions.CommandMap = CommandMap.Create(New HashSet(Of String) From {"SUBSCRIBE"}, False)

@NickCraver are we doing correct with this code? I assumed that I can block the "SUBSCRIBE" command because we don't use it. Do you suggest to allow it?

NickCraver commented 4 years ago

@kadikoy We still use subscriptions internally for notification of topology changes to which servers are primary and such - so note that because you as a consumer do not use it, does not mean it isn't used :)

kadikoy commented 4 years ago

@NickCraver ok then, I'll remove that line and allow SUBSCRIBE command.

Regarding the sudden increase on the thread pool, it's still under investigation. We are trying to reproduce the problem by having the same amount of load.

By the way, we started to see NoConnectionAvailable errors with FailureType=UnableToResolvePhysicalConnection after(probably) upgrading to 2.1.58. Then I realized we always had these errors but lately it was more visible. As people found out in #1120, it is introduced between 2.0.519 and 2.0.571. See related comment: https://github.com/StackExchange/StackExchange.Redis/issues/1120#issuecomment-506011128 So, I'm planning to downgrade to 2.0.519. Do you have any comment on this?

kadikoy commented 4 years ago

Another point of view for the thread pool problem:

We run the same application on several countries. We did a test on a smaller country. There was no errors or no bad performance. But we still decided to switch back to AppFabric to see how the thread pool behaves. It's very clear in this graph that we switched from Redis to AppFabric at 11:33:

temp

Any idea why we have higher "Queue Length / sec" with Redis?

gkorland commented 3 years ago

@kadikoy did you manage to overcome the issue?

kadikoy commented 3 years ago

@kadikoy did you manage to overcome the issue?

@gkorland No, we are still trying to reproduce on a test environment. It's risky to let it run on prod.

Also no luck with NoConnectionAvailable errors after downgrading from latest to the version 2.0.519. I think we'll wait for the next version and keep trying to reproduce the issue in the meantime.

tjstinso commented 3 years ago

@kadikoy Hey we're facing something along these lines after upgrading to a 2.1.x. Are you by chance seeing lock contention here as well?

NickCraver commented 3 years ago

Can anyone hitting this please try the 2.2.50 release? @TimLovellSmith has some changes to the backlog queueing to reduce contention that should help scenarios here. It's not something we can readily reproduce, so we'd love to see if anyone is still hitting similar load spikes (or improved ones).

bklooste commented 3 years ago

Were using 2.1.58 our perf runs do about 2K operations per second sustaince and dont see this issue, we have gone higher locally, we moved off iis as a host though ( core generic and Kestral host) it interferes too much into the thread model ( at the very least there is the ThreadContext) our remaining IIS services have lots of thread / timeout issues like this on Couchbase as well. It is possible app fabric has some IIS based improvements is it possible to run your test from a console app to remove IIS ? At least then you know there maybe some IIS related things App fabric is using which you could adjust .

Also is the server running windows or Linux ?

NickCraver commented 2 years ago

Not sure we can say much more here without more info...or even with. The main cause is an overloaded app with a ton of threads going likely stalled on a common area (guess based on experience). There's not much the victim can do here (without contributing to the overall problem) but I will say: net6.0 improves this for those able, but the real fix is to look at your threads (in a dump) and see what they're all doing. Likely moving things to async or removing whatever common blocking/contention is in play.

Going to close this out to tidy up given the above.

Piotr-Fr commented 2 months ago

High CPU, from 1%-2% as regular after the update to the newest goes to 58%.

Current version, before the update: Redis server v=7.2.5 sha=00000000:0 malloc=jemalloc-5.3.0 bits=64 build=d284576ab9ca3cc5 Ubuntu 22 LTS.

Had to roll back.

StackExchange / StackExchange.Redis