asynkron / protoactor-dotnet

Proto Actor - Ultra fast distributed actors for Go, C# and Java/Kotlin
http://proto.actor
Apache License 2.0
1.73k stars 288 forks source link

ShutdownAsync will not complete when it encounters an error, leaving member in zombie state #2128

Open benbenwilde opened 4 months ago

benbenwilde commented 4 months ago

I am running the latest pre-release 1.6.1-alpha.0.22

Here is where it occurs in the Cluster class:

public async Task ShutdownAsync(bool graceful = true, string reason = "")
    {
        Logger.LogInformation("Stopping Cluster {Id}", System.Id);

        // Inform all members of the cluster that this node intends to leave. Also, let the MemberList know that this
        // node was the one that initiated the shutdown to prevent another shutdown from being called.
        Logger.LogInformation("Setting GracefullyLeft gossip state for {Id}", System.Id);
        MemberList.Stopping = true;
        await Gossip.SetStateAsync(GossipKeys.GracefullyLeft, new Empty()).ConfigureAwait(false);

        ... continues with shutdown

As you can see, an error there would stop the graceful shutdown in it's tracks. So the member is blocked but never gets to shutdown, so throughout the cluster I see tons of "we are blocked" or "they are blocked" messages. Furthermore it never attempts this again because MemberList.Stopping is now set to true.

I'm not sure why the GossipActor is not able to respond so that's another thing i need to look into, since the gossip loop is also timing out on the BlockGracefullyLeft part.

Nonetheless, it seems it would be good if an issue with the gossip actor would not stop the member from being able to shutdown in a situation like this. So basically curious what other people's thoughts are, and I would be happy to submit some changes for this as well.

Basically thinking about adding try catches and possibly timeouts around each step, so that we can still continue to attempt the rest of the shutdown.

Off the top of my head, something like:

await AttemptTask(Gossip.SetStateAsync(GossipKeys.GracefullyLeft, new Empty()).ConfigureAwait(false), 
    TimeSpan.FromSeconds(1), "Setting GracefullyLeft"));

Where AttemptTask looks something like:

private async Task AttemptTask(Task task, TimeSpan timeout, string name)
{
    task.ContinueWith(t =>
    {
        // if the task fails after we timeout, still log the error
        if (!t.IsCompletedSuccessfully)
        {
            Logger.LogError(t.Exception, "Error during shutdown step [{stepName}]", name);
        }
    });

    try
    {
        await Task.WhenAny(Task.Delay(timeout), task);

        if (!task.IsCompleted)
        {
            // if the task isn't complete, we timed out
            Logger.LogError(t.Exception, "Timeout during shutdown step [{stepName}] after {timeout}", name, timeout);
        }
    }
    catch (Exception e)
    {
        // if the task fails while we are waiting, it will already be logged
    }
}

Thanks.


Here's the earlier mentioned error for reference:

RootContext Got exception waiting for RequestAsync response of SetGossipStateKey:SetGossipStateKey { Key = cluster:left, Value = { } } from nonhost/$gossip
System.TimeoutException
Request didn't receive any Response within the expected time.
StackTraceString: at Proto.Future.SharedFutureProcess.SharedFutureHandle.GetTask(CancellationToken cancellationToken)
at Proto.SenderContextExtensions.RequestAsync[T](ISenderContext self, PID target, Object message, CancellationToken cancellationToken)
at Proto.RootLoggingContext.RequestAsync[T](PID target, Object message, CancellationToken cancellationToken)
benbenwilde commented 2 months ago

The issue causing gossip timeouts should be fixed by https://github.com/asynkron/protoactor-dotnet/pull/2133, but this post brings up a separate issue, that the shutdown process is not reliable. This is pretty bad because the member will continue to run but throw various errors and not be able to do anything (zombie state) since it was set to be shutdown but never finished. I'm currently working around this issue by listening for .Cluster().MemberList.Stopping and when that triggers I give it 3 min to complete a clean shutdown otherwise i stop the application anyways.