Closed krasin-ga closed 3 months ago
Now this is very interesting (pinging @NickCraver @philon-msft ) - it looks like the intent is for ProcessBacklogAsync
to execute on a dedicated thread, but the fact that we have an await
in there means that if anything is there, we immediately jump to the thread pool - and we also have a big fat WaitOne(5000)
. This looks... unintentional. And bad, especially (as you say) for large clusters.
Great report @krasin-ga ! Will investigate further.
Yep, fully agree, we changed how this works several times back at the beginning but the thread pool hopping would happen at several points now. Needs love and potentially capping somehow in thread creation lest we create hundreds for hundreds of endpoints during a storm.
Thanks for the quick reply!
Please allow me to add my two cents.
I understand that this logic was added to circumvent issues related to using synchronous operations and thread pool starvation in general.
However, it seems that this is not the problem that the Redis client library should solve. It would be more rational to focus on minimizing overhead within the library itself. At first glance, it appears that abandoning synchronous locks in favor of wait-free alternatives and using System.Threading.Channels
for the backlog could simplify the code and improve performance in high-load scenarios. Even better would be to extract the backlog and the logic of its processing out of PhysicalBridge into its own interface to make it configurable for different scenarios.
We're tweaking how this works in #2667 to reflect the current dedicated thread behavior. With respect to the follow-up comments: I'm just going to assume you've only taken a glance because the rest of it's pretty uninformed with respect to how things in the library behave.
I'm going to write off the other comments as not having taken any deep view of things in the library or dealt with thread pool starvation in hundreds of apps as we have and continue to help others with and leave it at that.
As I understand the current architecture:
From the top view, PhysicalPridge is responsible for pushing messages to the backlog from which they will be read by a dedicated thread and serialized to a pipe (from which searialized messages will be pushed to the socket by PipeScheduler) or completed by timeout. Dedicated threads are only used to deal with situations when the runtime experiences thread pool starvation. There is one PhysicalPridge instance per connection. As of now there is a way to cofigure PipeScheduler, but there is no way to configure backlog processing behavior.
My points:
1) Having a thread per backlog/connection is far from optimal, especially when the Redis cluster is quite large. More threads mean more overhead on context switches, scheduling, etc. 2) If my application is experiencing thread pool starvation due to the usage of synchronous I/O or synchronous locking, having this will only hide the real source of the issue and not help much (especially if I use SocketManager.ThreadPool), and my application will still perform very poorly. 3) The .NET thread pool has gotten significantly better at detecting starvation and will quickly create threads by itself. 4) It's an absolutely good idea to help people diagnose that the thread pool is experiencing starvation, as in the message of RedisTimeoutException. Or to prevent such issues with the help of some Roslyn analyzer or find possible sources of starvation with Ben.BlockingDetector, etc. 5) It will not make things worse if backlog behavior can be make configurable
I wrote a quick benchmark that emulates backlog behavior and processes 10,000,000 items:
BenchmarkDotNet v0.13.12, Windows 11
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 8.0.200
[Host] : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
DefaultJob : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
| Method | NumberOfConsumers | NumberOfProducers | Mean | Error | StdDev | Median |
|----------------|------------------ |------------------ |------------:|----------:|----------:|------------:|
| Tasks+Channels | 1 | 1 | 2,515.8 ms | 48.65 ms | 47.78 ms | 2,500.9 ms |
| Threads+CQ | 1 | 1 | 2,518.7 ms | 44.62 ms | 41.74 ms | 2,517.4 ms |
| Tasks+Channels | 1 | 8 | 2,774.9 ms | 22.63 ms | 21.17 ms | 2,770.6 ms |
| Threads+CQ | 1 | 8 | 3,265.1 ms | 88.79 ms | 261.79 ms | 3,432.6 ms |
| Tasks+Channels | 1 | 32 | 2,831.8 ms | 55.89 ms | 87.02 ms | 2,825.2 ms |
| Threads+CQ | 1 | 32 | 3,791.6 ms | 72.86 ms | 77.96 ms | 3,770.4 ms |
| Tasks+Channels | 1 | 128 | 3,117.2 ms | 59.07 ms | 68.03 ms | 3,127.5 ms |
| Threads+CQ | 1 | 128 | 3,986.0 ms | 77.86 ms | 98.47 ms | 3,987.2 ms |
| Tasks+Channels | 8 | 1 | 5,156.9 ms | 41.39 ms | 38.72 ms | 5,152.8 ms |
| Threads+CQ | 8 | 1 | 19,966.5 ms | 86.58 ms | 80.99 ms | 19,966.4 ms |
| Tasks+Channels | 8 | 8 | 864.5 ms | 16.60 ms | 18.45 ms | 859.2 ms |
| Threads+CQ | 8 | 8 | 1,586.8 ms | 31.37 ms | 55.76 ms | 1,601.4 ms |
| Tasks+Channels | 8 | 32 | 1,117.6 ms | 16.38 ms | 15.32 ms | 1,116.2 ms |
| Threads+CQ | 8 | 32 | 912.3 ms | 18.10 ms | 47.67 ms | 913.7 ms |
| Tasks+Channels | 8 | 128 | 1,206.8 ms | 23.42 ms | 30.45 ms | 1,198.5 ms |
| Threads+CQ | 8 | 128 | 983.0 ms | 18.51 ms | 19.80 ms | 988.7 ms |
| Tasks+Channels | 32 | 1 | 5,191.6 ms | 31.90 ms | 28.28 ms | 5,184.7 ms |
| Threads+CQ | 32 | 1 | 19,619.6 ms | 162.84 ms | 144.36 ms | 19,644.6 ms |
| Tasks+Channels | 32 | 8 | 977.1 ms | 8.87 ms | 8.29 ms | 978.7 ms |
| Threads+CQ | 32 | 8 | 1,825.8 ms | 35.98 ms | 38.50 ms | 1,838.9 ms |
| Tasks+Channels | 32 | 32 | 652.4 ms | 5.04 ms | 4.47 ms | 654.0 ms |
| Threads+CQ | 32 | 32 | 835.5 ms | 16.55 ms | 45.58 ms | 839.3 ms |
| Tasks+Channels | 32 | 128 | 661.7 ms | 13.03 ms | 19.10 ms | 659.8 ms |
| Threads+CQ | 32 | 128 | 839.7 ms | 16.69 ms | 36.29 ms | 844.8 ms |
| Tasks+Channels | 128 | 1 | 4,756.7 ms | 21.07 ms | 17.59 ms | 4,760.6 ms |
| Threads+CQ | 128 | 1 | 19,599.3 ms | 262.33 ms | 245.39 ms | 19,649.7 ms |
| Tasks+Channels | 128 | 8 | 995.9 ms | 9.45 ms | 8.84 ms | 997.0 ms |
| Threads+CQ | 128 | 8 | 2,869.4 ms | 56.35 ms | 100.17 ms | 2,864.1 ms |
| Tasks+Channels | 128 | 32 | 581.5 ms | 11.47 ms | 20.09 ms | 583.2 ms |
| Threads+CQ | 128 | 32 | 867.5 ms | 17.07 ms | 27.56 ms | 869.9 ms |
| Tasks+Channels | 128 | 128 | 618.2 ms | 6.56 ms | 6.14 ms | 618.5 ms |
| Threads+CQ | 128 | 128 | 912.5 ms | 19.55 ms | 57.66 ms | 921.6 ms |
Both methods can be optimized, but I hope they are still close enough for illustrative purposes.
PS: I'm not a native English speaker and don't use English for communication very often, so my selection of words may not be very accurate. Please forgive any unintended offense.
@krasin-ga I think we have mismatch in understanding here: that the backlog is part of the normal flow at all. It is not. The backlog is not normally written to, but it must be synchronously drained from to help prevent spiral of death scenarios (because it's quite likely that writes or response processing cannot complete due to starvation). The normal path for async commands is fully async and uses Pipes.
The backlog is only used as a buffer when we can't write to the connection, e.g. we're disconnected. And as per the docs, you can turn this off if you want to fail immediately indicating no connections are available. You're stating this can be optimized: hey sure, I'm all ears, but it needs to be in a way that the backlog has a chance of draining thread pool starvation is hit (which is a common scenario for applications losing connections unexpectedly, often having some sync-over-async elsewhere we don't control). That's why we create a thread to do this work and listen for incoming for 5 more seconds instead of repeatedly spinning up a thread if we're recovering from overwhelming the outbound socket buffer.
As for the thread pool growing, this is often a misconception when it gets into the details.
I think we have mismatch in understanding here: that the backlog is part of the normal flow at all. It is not
Yep, I certainly misunderstood that part. Thanks for the clarification!
Upd: But not that much: if we can't take mutex we still end up enqueuing to backlog, and this will happen quite often under load.
@krasin-ga Thanks again for the report! Version 2.7.33 with this fixed up is available on NuGet now :)
Nice, thanks for the quick fix!
@krasin-ga @NickCraver Do you guys know which version this issue started to happen?
@giuliano-macedo I'd assume this was a potential issue when consuming a lot of endpoints since 2.5.x, that's my best guess here.
@NickCraver Great! And thanks for the fix!
Here is the code that starts a dedicated backlog processing thread and provides some motivation in comments on the decision not to use the thread pool: https://github.com/StackExchange/StackExchange.Redis/blob/441f89afeff6d9a55ecf6e413ed5bd29b5aec19d/src/StackExchange.Redis/PhysicalBridge.cs#L901-L912
But after asynchronously awaiting
ProcessBridgeBacklogAsync
, the rest of the code (including loop iterations) will not only continue on the thread pool thread but also block it on a sync lock:" https://github.com/StackExchange/StackExchange.Redis/blob/441f89afeff6d9a55ecf6e413ed5bd29b5aec19d/src/StackExchange.Redis/PhysicalBridge.cs#L988-L1012In our case, we have several relatively large Redis clusters with tens to hundreds of nodes, and sometimes we experience situations where the majority of thread pool threads are being blocked inside instances of PhysicalBridge:
So we have the best of two worlds: hundreds of expensive creations of worker threads and thread pool starvation 🙂