Pass cancellation token to grpc reader

benbenwilde commented 2 months ago

Description

Regardless of all of the following, this should be a good change since we are just passing a cancellation token somewhere you would think it should be passed. Nonetheless I've outlined everything below.

This fixes a difficult to reproduce issue.

It would occur when members were being added / removed from the cluster. The GossipActor on a member (not a new member) would stop receiving messages and everything would timeout for 15 min. Which of course would cause various actors to be unresponsive or stop working.

After applying this fix i was no longer able to reproduce the issue with the following method.

The easiest way for me to reproduce was to continually kill a pod in k8s which would join the cluster as a new member each time it restarted, every 20s (which gave it enough time to start and to see logs on other members about it joining the cluster. Of course killing a pod like this without a graceful shutdown will cause some temporary issues but this is not about that.

My theory of how this occurred: When a member is removed it has to be removed from the EndpointManager on other members. Each removal is done separately within a lock. Part of that removal involves stopping the ServerConnector for that endpoint. A cancellation token is canceled and then it waits for it's main method to return. Depending on the timing, it can end up waiting on a grpc call await call.ResponseStream.MoveNext() and there is nothing in there that will see the cancellation token (this PR changes that). Then something in the grpc call would timeout after 15 min (every time i reproduced the issue, it lasted for 15 min and then recovered), producing a [ServerConnector] Error in reader RpcException error log. Then it would finally finish and exit the log allowing other processes to continue. On another thread, the SendGossipStateRequest is started but doesn't complete until after the 15 min. During this time we continually see Gossip loop failed error logs and many other errors as a result. SendGossipStateRequest does use reentry but the handler for that message does not return until after the GossipRequest is ultimately written to the Channel in the Endpoint class. You can follow the code that runs synchronously all the way down, RequestReenter => RequestAsync => RequestAsync => Send => SendUserMessage => SendUserMessage => RemoteProcess.SendUserMessage => Send => GetEndpoint then SendMessage => TryWrite. GetEndpoint calls GetOrAddServerEndpoint which will wait enter the lock that is blocked by the ServerConnector if the endpoint is not already there (or in this case, has already been removed). Of course there is logic for blocking endpoints for a time but when it's stuck waiting for the ServerConnector to finish, the endpoint hasn't been added to the block list yet. Anyways it wait 15 min for that lock and then after that things start flowing again and recover. This PR causes the grpc call await call.ResponseStream.MoveNext() to return right away when the ServerConnector is stopped by passing the cancellation token.

Purpose

This pull request is a:

[x] Bugfix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist

[ ] I have added tests that prove my fix is effective or that my feature works
[x] I have added necessary documentation (if appropriate)

CLAassistant commented 2 months ago

All committers have signed the CLA.

rogeralsing commented 2 months ago

Thanks!, merging!

asynkron / protoactor-dotnet