Open benbenwilde opened 4 months ago
The issue causing gossip timeouts should be fixed by https://github.com/asynkron/protoactor-dotnet/pull/2133, but this post brings up a separate issue, that the shutdown process is not reliable. This is pretty bad because the member will continue to run but throw various errors and not be able to do anything (zombie state) since it was set to be shutdown but never finished. I'm currently working around this issue by listening for .Cluster().MemberList.Stopping
and when that triggers I give it 3 min to complete a clean shutdown otherwise i stop the application anyways.
I am running the latest pre-release
1.6.1-alpha.0.22
Here is where it occurs in the Cluster class:
As you can see, an error there would stop the graceful shutdown in it's tracks. So the member is blocked but never gets to shutdown, so throughout the cluster I see tons of "we are blocked" or "they are blocked" messages. Furthermore it never attempts this again because MemberList.Stopping is now set to true.
I'm not sure why the GossipActor is not able to respond so that's another thing i need to look into, since the gossip loop is also timing out on the
BlockGracefullyLeft
part.Nonetheless, it seems it would be good if an issue with the gossip actor would not stop the member from being able to shutdown in a situation like this. So basically curious what other people's thoughts are, and I would be happy to submit some changes for this as well.
Basically thinking about adding try catches and possibly timeouts around each step, so that we can still continue to attempt the rest of the shutdown.
Off the top of my head, something like:
Where AttemptTask looks something like:
Thanks.
Here's the earlier mentioned error for reference: