ShutdownAsync will not complete when it encounters an error, leaving member in zombie state

I am running the latest pre-release 1.6.1-alpha.0.22

Here is where it occurs in the Cluster class:

public async Task ShutdownAsync(bool graceful = true, string reason = "")
    {
        Logger.LogInformation("Stopping Cluster {Id}", System.Id);

        // Inform all members of the cluster that this node intends to leave. Also, let the MemberList know that this
        // node was the one that initiated the shutdown to prevent another shutdown from being called.
        Logger.LogInformation("Setting GracefullyLeft gossip state for {Id}", System.Id);
        MemberList.Stopping = true;
        await Gossip.SetStateAsync(GossipKeys.GracefullyLeft, new Empty()).ConfigureAwait(false);

        ... continues with shutdown

As you can see, an error there would stop the graceful shutdown in it's tracks. So the member is blocked but never gets to shutdown, so throughout the cluster I see tons of "we are blocked" or "they are blocked" messages. Furthermore it never attempts this again because MemberList.Stopping is now set to true.

I'm not sure why the GossipActor is not able to respond so that's another thing i need to look into, since the gossip loop is also timing out on the BlockGracefullyLeft part.

Nonetheless, it seems it would be good if an issue with the gossip actor would not stop the member from being able to shutdown in a situation like this. So basically curious what other people's thoughts are, and I would be happy to submit some changes for this as well.

Basically thinking about adding try catches and possibly timeouts around each step, so that we can still continue to attempt the rest of the shutdown.

Off the top of my head, something like:

await AttemptTask(Gossip.SetStateAsync(GossipKeys.GracefullyLeft, new Empty()).ConfigureAwait(false), 
    TimeSpan.FromSeconds(1), "Setting GracefullyLeft"));

Where AttemptTask looks something like:

private async Task AttemptTask(Task task, TimeSpan timeout, string name)
{
    task.ContinueWith(t =>
    {
        // if the task fails after we timeout, still log the error
        if (!t.IsCompletedSuccessfully)
        {
            Logger.LogError(t.Exception, "Error during shutdown step [{stepName}]", name);
        }
    });

    try
    {
        await Task.WhenAny(Task.Delay(timeout), task);

        if (!task.IsCompleted)
        {
            // if the task isn't complete, we timed out
            Logger.LogError(t.Exception, "Timeout during shutdown step [{stepName}] after {timeout}", name, timeout);
        }
    }
    catch (Exception e)
    {
        // if the task fails while we are waiting, it will already be logged
    }
}

Thanks.

Here's the earlier mentioned error for reference:

RootContext Got exception waiting for RequestAsync response of SetGossipStateKey:SetGossipStateKey { Key = cluster:left, Value = { } } from nonhost/$gossip
System.TimeoutException
Request didn't receive any Response within the expected time.
StackTraceString: at Proto.Future.SharedFutureProcess.SharedFutureHandle.GetTask(CancellationToken cancellationToken)
at Proto.SenderContextExtensions.RequestAsync[T](ISenderContext self, PID target, Object message, CancellationToken cancellationToken)
at Proto.RootLoggingContext.RequestAsync[T](PID target, Object message, CancellationToken cancellationToken)

asynkron / protoactor-dotnet

ShutdownAsync will not complete when it encounters an error, leaving member in zombie state #2128