Thread pool starvation caused by sync waiting inside PhysicalBridge

krasin-ga commented 4 months ago

Here is the code that starts a dedicated backlog processing thread and provides some motivation in comments on the decision not to use the thread pool: https://github.com/StackExchange/StackExchange.Redis/blob/441f89afeff6d9a55ecf6e413ed5bd29b5aec19d/src/StackExchange.Redis/PhysicalBridge.cs#L901-L912

But after asynchronously awaiting ProcessBridgeBacklogAsync, the rest of the code (including loop iterations) will not only continue on the thread pool thread but also block it on a sync lock:" https://github.com/StackExchange/StackExchange.Redis/blob/441f89afeff6d9a55ecf6e413ed5bd29b5aec19d/src/StackExchange.Redis/PhysicalBridge.cs#L988-L1012

In our case, we have several relatively large Redis clusters with tens to hundreds of nodes, and sometimes we experience situations where the majority of thread pool threads are being blocked inside instances of PhysicalBridge:

So we have the best of two worlds: hundreds of expensive creations of worker threads and thread pool starvation 🙂

mgravell commented 4 months ago

Now this is very interesting (pinging @NickCraver @philon-msft ) - it looks like the intent is for ProcessBacklogAsync to execute on a dedicated thread, but the fact that we have an await in there means that if anything is there, we immediately jump to the thread pool - and we also have a big fat WaitOne(5000). This looks... unintentional. And bad, especially (as you say) for large clusters.

Great report @krasin-ga ! Will investigate further.

NickCraver commented 4 months ago

Yep, fully agree, we changed how this works several times back at the beginning but the thread pool hopping would happen at several points now. Needs love and potentially capping somehow in thread creation lest we create hundreds for hundreds of endpoints during a storm.

krasin-ga commented 3 months ago

Thanks for the quick reply!

Please allow me to add my two cents.

I understand that this logic was added to circumvent issues related to using synchronous operations and thread pool starvation in general.

However, it seems that this is not the problem that the Redis client library should solve. It would be more rational to focus on minimizing overhead within the library itself. At first glance, it appears that abandoning synchronous locks in favor of wait-free alternatives and using System.Threading.Channels for the backlog could simplify the code and improve performance in high-load scenarios. Even better would be to extract the backlog and the logic of its processing out of PhysicalBridge into its own interface to make it configurable for different scenarios.

NickCraver commented 3 months ago

We're tweaking how this works in #2667 to reflect the current dedicated thread behavior. With respect to the follow-up comments: I'm just going to assume you've only taken a glance because the rest of it's pretty uninformed with respect to how things in the library behave.

I'm going to write off the other comments as not having taken any deep view of things in the library or dealt with thread pool starvation in hundreds of apps as we have and continue to help others with and leave it at that.

krasin-ga commented 3 months ago

As I understand the current architecture:

From the top view, PhysicalPridge is responsible for pushing messages to the backlog from which they will be read by a dedicated thread and serialized to a pipe (from which searialized messages will be pushed to the socket by PipeScheduler) or completed by timeout. Dedicated threads are only used to deal with situations when the runtime experiences thread pool starvation. There is one PhysicalPridge instance per connection. As of now there is a way to cofigure PipeScheduler, but there is no way to configure backlog processing behavior.

My points:

1) Having a thread per backlog/connection is far from optimal, especially when the Redis cluster is quite large. More threads mean more overhead on context switches, scheduling, etc. 2) If my application is experiencing thread pool starvation due to the usage of synchronous I/O or synchronous locking, having this will only hide the real source of the issue and not help much (especially if I use SocketManager.ThreadPool), and my application will still perform very poorly. 3) The .NET thread pool has gotten significantly better at detecting starvation and will quickly create threads by itself. 4) It's an absolutely good idea to help people diagnose that the thread pool is experiencing starvation, as in the message of RedisTimeoutException. Or to prevent such issues with the help of some Roslyn analyzer or find possible sources of starvation with Ben.BlockingDetector, etc. 5) It will not make things worse if backlog behavior can be make configurable

I wrote a quick benchmark that emulates backlog behavior and processes 10,000,000 items:

BenchmarkDotNet v0.13.12, Windows 11
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 8.0.200
  [Host]     : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2

| Method         | NumberOfConsumers | NumberOfProducers | Mean        | Error     | StdDev    | Median      |
|----------------|------------------ |------------------ |------------:|----------:|----------:|------------:|
| Tasks+Channels | 1                 | 1                 |  2,515.8 ms |  48.65 ms |  47.78 ms |  2,500.9 ms |
| Threads+CQ     | 1                 | 1                 |  2,518.7 ms |  44.62 ms |  41.74 ms |  2,517.4 ms |
| Tasks+Channels | 1                 | 8                 |  2,774.9 ms |  22.63 ms |  21.17 ms |  2,770.6 ms |
| Threads+CQ     | 1                 | 8                 |  3,265.1 ms |  88.79 ms | 261.79 ms |  3,432.6 ms |
| Tasks+Channels | 1                 | 32                |  2,831.8 ms |  55.89 ms |  87.02 ms |  2,825.2 ms |
| Threads+CQ     | 1                 | 32                |  3,791.6 ms |  72.86 ms |  77.96 ms |  3,770.4 ms |
| Tasks+Channels | 1                 | 128               |  3,117.2 ms |  59.07 ms |  68.03 ms |  3,127.5 ms |
| Threads+CQ     | 1                 | 128               |  3,986.0 ms |  77.86 ms |  98.47 ms |  3,987.2 ms |
| Tasks+Channels | 8                 | 1                 |  5,156.9 ms |  41.39 ms |  38.72 ms |  5,152.8 ms |
| Threads+CQ     | 8                 | 1                 | 19,966.5 ms |  86.58 ms |  80.99 ms | 19,966.4 ms |
| Tasks+Channels | 8                 | 8                 |    864.5 ms |  16.60 ms |  18.45 ms |    859.2 ms |
| Threads+CQ     | 8                 | 8                 |  1,586.8 ms |  31.37 ms |  55.76 ms |  1,601.4 ms |
| Tasks+Channels | 8                 | 32                |  1,117.6 ms |  16.38 ms |  15.32 ms |  1,116.2 ms |
| Threads+CQ     | 8                 | 32                |    912.3 ms |  18.10 ms |  47.67 ms |    913.7 ms |
| Tasks+Channels | 8                 | 128               |  1,206.8 ms |  23.42 ms |  30.45 ms |  1,198.5 ms |
| Threads+CQ     | 8                 | 128               |    983.0 ms |  18.51 ms |  19.80 ms |    988.7 ms |
| Tasks+Channels | 32                | 1                 |  5,191.6 ms |  31.90 ms |  28.28 ms |  5,184.7 ms |
| Threads+CQ     | 32                | 1                 | 19,619.6 ms | 162.84 ms | 144.36 ms | 19,644.6 ms |
| Tasks+Channels | 32                | 8                 |    977.1 ms |   8.87 ms |   8.29 ms |    978.7 ms |
| Threads+CQ     | 32                | 8                 |  1,825.8 ms |  35.98 ms |  38.50 ms |  1,838.9 ms |
| Tasks+Channels | 32                | 32                |    652.4 ms |   5.04 ms |   4.47 ms |    654.0 ms |
| Threads+CQ     | 32                | 32                |    835.5 ms |  16.55 ms |  45.58 ms |    839.3 ms |
| Tasks+Channels | 32                | 128               |    661.7 ms |  13.03 ms |  19.10 ms |    659.8 ms |
| Threads+CQ     | 32                | 128               |    839.7 ms |  16.69 ms |  36.29 ms |    844.8 ms |
| Tasks+Channels | 128               | 1                 |  4,756.7 ms |  21.07 ms |  17.59 ms |  4,760.6 ms |
| Threads+CQ     | 128               | 1                 | 19,599.3 ms | 262.33 ms | 245.39 ms | 19,649.7 ms |
| Tasks+Channels | 128               | 8                 |    995.9 ms |   9.45 ms |   8.84 ms |    997.0 ms |
| Threads+CQ     | 128               | 8                 |  2,869.4 ms |  56.35 ms | 100.17 ms |  2,864.1 ms |
| Tasks+Channels | 128               | 32                |    581.5 ms |  11.47 ms |  20.09 ms |    583.2 ms |
| Threads+CQ     | 128               | 32                |    867.5 ms |  17.07 ms |  27.56 ms |    869.9 ms |
| Tasks+Channels | 128               | 128               |    618.2 ms |   6.56 ms |   6.14 ms |    618.5 ms |
| Threads+CQ     | 128               | 128               |    912.5 ms |  19.55 ms |  57.66 ms |    921.6 ms |

Both methods can be optimized, but I hope they are still close enough for illustrative purposes.

PS: I'm not a native English speaker and don't use English for communication very often, so my selection of words may not be very accurate. Please forgive any unintended offense.

NickCraver commented 3 months ago

@krasin-ga I think we have mismatch in understanding here: that the backlog is part of the normal flow at all. It is not. The backlog is not normally written to, but it must be synchronously drained from to help prevent spiral of death scenarios (because it's quite likely that writes or response processing cannot complete due to starvation). The normal path for async commands is fully async and uses Pipes.

The backlog is only used as a buffer when we can't write to the connection, e.g. we're disconnected. And as per the docs, you can turn this off if you want to fail immediately indicating no connections are available. You're stating this can be optimized: hey sure, I'm all ears, but it needs to be in a way that the backlog has a chance of draining thread pool starvation is hit (which is a common scenario for applications losing connections unexpectedly, often having some sync-over-async elsewhere we don't control). That's why we create a thread to do this work and listen for incoming for 5 more seconds instead of repeatedly spinning up a thread if we're recovering from overwhelming the outbound socket buffer.

As for the thread pool growing, this is often a misconception when it gets into the details.

We support more than .NET 6+, e.g. .NET Framework which doesn't have any love for rapid thread pool growth expansion at all.
The .NET Core mechanism you're thinking of doesn't apply to Monitor.Wait. We discovered this after discussion with Stephen triggered by #1992 when we attempted to utilize that behavior.

krasin-ga commented 3 months ago

I think we have mismatch in understanding here: that the backlog is part of the normal flow at all. It is not

Yep, I certainly misunderstood that part. Thanks for the clarification!

Upd: But not that much: if we can't take mutex we still end up enqueuing to backlog, and this will happen quite often under load.

https://github.com/StackExchange/StackExchange.Redis/blob/60e5d174c87178aff5906b1a0f8d9755f7e6a669/src/StackExchange.Redis/PhysicalBridge.cs#L780-L791

NickCraver commented 3 months ago

@krasin-ga Thanks again for the report! Version 2.7.33 with this fixed up is available on NuGet now :)

krasin-ga commented 3 months ago

Nice, thanks for the quick fix!

giuliano-macedo commented 2 months ago

@krasin-ga @NickCraver Do you guys know which version this issue started to happen?

NickCraver commented 2 months ago

@giuliano-macedo I'd assume this was a potential issue when consuming a lot of endpoints since 2.5.x, that's my best guess here.

giuliano-macedo commented 2 months ago

@NickCraver Great! And thanks for the fix!

StackExchange / StackExchange.Redis

Thread pool starvation caused by sync waiting inside PhysicalBridge #2664