Fix - Under load and during topology changes, thread saturation can occur, causing a lockup

benbenwilde commented 2 weeks ago

Description

Background:

This is an issue that has caused us some pain in our production environments, rendering our proto.actor cluster inoperable until we took action to reduce load on the system. In our environment have have pods coming on and offline all the time, sometimes under heavy load, and this exploited an issue in the EndpointManager, where requests for new endpoints wait behind a lock while another thread disposes an endpoint.

The change:

This PR modifies EndpointManager so that it disposes endpoints outside of the lock, while any concurrent requests to that endpoint return a blocked endpoint, instead of waiting behind a lock. This way, while an endpoint cleans up, new endpoints can still be added, and multiple endpoints can be disposed at the same time.

This change makes EndpointManager more robust by minimizing the time spent inside the lock, preventing potential thread saturation and lockup during topology changes. Since we don't want to send requests to a disposing endpoint, those requests get a blocking endpoint instead. After the dispose is complete, the same conditions apply as before for blocking and unblocking the endpoint, namely the ShouldBlock flag on the event, and the WaitAfterEndpointTerminationTimeSpan config.

Testing:

I've added a project called EndpointManagerTest which reproduces the issue intermittently. After applying the change, the issue no longer occurs. Due to the nature of this issue, results can vary significantly between different environments or CPUs. So a dockerfile has been provided as well so it can be run more consistently, when run with a limit of 1 CPU. Of course results will still vary for different machines. Without the fix, the issue only occurs intermittently (in my case it would occur maybe half the time), due to race conditions that must occur, and which work ends up getting allocated by the threadpool.

Details:

In our production environment, we had a scenario where many proto.actor client pods would start up and start sending messages to actors on a couple proto.actor member pods. The actors would typically handle them easily so the load was not as issue. But all these clients would have to be added when sending responses for partition identity or placement messages. If this occurred while some endpoint was being terminated, it could result in thread saturation and lockup. A thread would enter the EndpointManager lock to dispose an endpoint, while many messages are trying to be sent to new endpoints, which all wait behind the lock. These are all blocking waits, so they all hold the thread that was allocated to them, so other work can't use them. The threadpool of course allocates new threads according to it's algorithm, but if they keep getting allocated to tasks that will end up waiting behind a lock, then no work will be done, potentially for quite some time. Once the endpoint dispose work is finally given a thread and can complete, then everything can start flowing again, but that doesn't always happen when there are so many threads in blocking waits and so much competition for the new threads as they are added. In our production environment, we had another connected resource that would disconnect since its health checks could not complete under the lockup, which would ultimately cause a restart of the pod. Then it would enter an endless cycle of locking up and restarting.

Purpose

This pull request is a:

[x] Bugfix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist

[x] I have added tests that prove my fix is effective or that my feature works
[x] I have added necessary documentation (if appropriate)

benbenwilde commented 2 weeks ago

The checks for tests that failed here worked when i ran them locally, maybe it needs a retry

benbenwilde commented 2 weeks ago

I found another issue when one of the tests were failing. It was timing out when shutting down the kestrel host, because after sending the disconnect request, it could end up waiting forever for the end of the stream. This way we always get a clean shutdown regardless of what is happening on the other side.

asynkron / protoactor-dotnet