Closed kadikoy closed 2 years ago
Do you have enough threads ?
ThreadPool.SetMinThreads(200, 200);
Have you tried without transactions i think with redis they will queue things and then you keep adding more .
run the key expire as normal not in a transaction. If it solves your issue you can thing about how to get it better.
Worth rereading https://stackexchange.github.io/StackExchange.Redis/Transactions.html
@bklooste Thanks for the reply,
Yes, we have enough threads. You can see it in the exceptions that I provided:
IOCP: (Busy=4,Free=396,Min=200,Max=400), WORKER: (Busy=129,Free=271,Min=240,Max=400)
IOCP: (Busy=2,Free=398,Min=200,Max=400), WORKER: (Busy=182,Free=218,Min=240,Max=400)
We still need the transaction even if I call the KeyExpire separately. The condition is the main reason that we use the transaction:
tran.AddCondition(Condition.HashEqual(CacheKey, "version", version))
We have a logic that it collects the changes in memory and put the data at the end of the process. If the data is already changed in the meantime, then we don't want to override it. The "version" key helps us to understand if the data is already changed or not.
What type of hardware are you running on? The 129-182 threads is a lot to be synchronously running. Do you happen to have graphs of the thread pool in general? Generally speaking, 1,500ms is a low timeout for a cloud environment and you may be going over it. When you're comparing these spikes to AppFabric, what timeout is in play there and how does it behave?
I ask these questions because often what we see is that these spikes aren't new to apps or workloads...recognition of the stalls and actually throwing a timeout is the new piece. And at 1500ms, that threshold is fairly low for most cloud conditions (keeping in mind it's a worst case scenario, not the normal time).
About the timeout, we also use 1500ms for AppFabric. AppFabric has timeout errors as well, but we are pleased with the performance regarding duration.
I know it's the worst case scenario but think about the requests below 1500ms. They don't throw exception but log very high duration, which means bad performance. With Redis, we have both problems under high load. Some requests are throwing exception, some are not but they are slow. We have ~28 cache get request per page by the way. So even if we set a higher timeout, we'll still see high duration(which is our main problem compared to AppFabric).
Our web servers are C5.xlarge EC2 instances on AWS.
I'm not sure what you mean by "graphs of the thread pool in general". Here are some related graphs:
Physical and logical threads of the w3wp processes (clear difference with AppFabric)
Request counters (we have ~28 cache get call per page request)
Statistics from all errors (520 in total)
Are you getting exceptions around that 1:40 mark? I'm curious what that rather huge thread pool increase is there, it indicates a lot of instantaneous load of some sort, which is usually a recipe for a pipe-clogging issue.
Yes, all exceptions are logged in ~8 minutes that we run with Redis. Between 13:40 to 13:49.
' StackExchange.Redis creates an extra connection for pub/sub which we don't use.
configOptions.CommandMap = CommandMap.Create(New HashSet(Of String) From {"SUBSCRIBE"}, False)
@NickCraver are we doing correct with this code? I assumed that I can block the "SUBSCRIBE" command because we don't use it. Do you suggest to allow it?
@kadikoy We still use subscriptions internally for notification of topology changes to which servers are primary and such - so note that because you as a consumer do not use it, does not mean it isn't used :)
@NickCraver ok then, I'll remove that line and allow SUBSCRIBE command.
Regarding the sudden increase on the thread pool, it's still under investigation. We are trying to reproduce the problem by having the same amount of load.
By the way, we started to see NoConnectionAvailable errors with FailureType=UnableToResolvePhysicalConnection after(probably) upgrading to 2.1.58. Then I realized we always had these errors but lately it was more visible. As people found out in #1120, it is introduced between 2.0.519 and 2.0.571. See related comment: https://github.com/StackExchange/StackExchange.Redis/issues/1120#issuecomment-506011128 So, I'm planning to downgrade to 2.0.519. Do you have any comment on this?
Another point of view for the thread pool problem:
We run the same application on several countries. We did a test on a smaller country. There was no errors or no bad performance. But we still decided to switch back to AppFabric to see how the thread pool behaves. It's very clear in this graph that we switched from Redis to AppFabric at 11:33:
Any idea why we have higher "Queue Length / sec" with Redis?
@kadikoy did you manage to overcome the issue?
@kadikoy did you manage to overcome the issue?
@gkorland No, we are still trying to reproduce on a test environment. It's risky to let it run on prod.
Also no luck with NoConnectionAvailable errors after downgrading from latest to the version 2.0.519. I think we'll wait for the next version and keep trying to reproduce the issue in the meantime.
@kadikoy Hey we're facing something along these lines after upgrading to a 2.1.x. Are you by chance seeing lock contention here as well?
Can anyone hitting this please try the 2.2.50 release? @TimLovellSmith has some changes to the backlog queueing to reduce contention that should help scenarios here. It's not something we can readily reproduce, so we'd love to see if anyone is still hitting similar load spikes (or improved ones).
Were using 2.1.58 our perf runs do about 2K operations per second sustaince and dont see this issue, we have gone higher locally, we moved off iis as a host though ( core generic and Kestral host) it interferes too much into the thread model ( at the very least there is the ThreadContext) our remaining IIS services have lots of thread / timeout issues like this on Couchbase as well. It is possible app fabric has some IIS based improvements is it possible to run your test from a console app to remove IIS ? At least then you know there maybe some IIS related things App fabric is using which you could adjust .
Also is the server running windows or Linux ?
Not sure we can say much more here without more info...or even with. The main cause is an overloaded app with a ton of threads going likely stalled on a common area (guess based on experience). There's not much the victim can do here (without contributing to the overall problem) but I will say: net6.0
improves this for those able, but the real fix is to look at your threads (in a dump) and see what they're all doing. Likely moving things to async or removing whatever common blocking/contention is in play.
Going to close this out to tidy up given the above.
High CPU, from 1%-2% as regular after the update to the newest goes to 58%.
Current version, before the update: Redis server v=7.2.5 sha=00000000:0 malloc=jemalloc-5.3.0 bits=64 build=d284576ab9ca3cc5 Ubuntu 22 LTS.
Had to roll back.
Story and setup
We are trying to replace AppFabric with Redis. We have a switch that we can enable Redis. As soon as we see problems, we switch back to AppFabric.
Servers of the application and the Redis cluster are on the same VPC on AWS. We have 4 shards on the cluster and each shard has one primary and one replica node. We have a connection pool for Redis on every application server with capacity of 10.
When the load is low there is no problem, even faster than AppFabric. But we see high duration during peak times, so we believe there's a limit that Redis starts giving problems. It also effects the CPU by the way.
We couldn't find any problems on the Redis cluster (no slowlog etc, also see graphs at the bottom), so we suspect the problem is on the client side.
Redis configuration
We have two types of cache objects:
Without version: We use simple StringSet and StringGet for these.
With version:
Errors
Last time Redis was enabled for 8 minutes. We had two types of timeout errors:
This is only one of the errors. Please tell me which parameter you want to see (mgr, in, rs, busy worker thread etc), I'm ready to provide all the 506 values to make it easier to analyze.
Cloudwatch graphs (timespan = 1 minute)