Graceful shutdown is not graceful

asynkron / protoactor-dotnet

Proto Actor - Ultra fast distributed actors for Go, C# and Java/Kotlin

http://proto.actor

Apache License 2.0

1.74k stars 288 forks source link

Graceful shutdown is not graceful #2118

Open AqlaSolutions opened 7 months ago

AqlaSolutions commented 7 months ago

When Cluster.ShutdownAsync(true) is called, grain actors don't receive Stopping/Stopped messages. When PID.Stop is called during ActorSystem shutdown process, the user message is created but it doesn't reach the actor code. The shutdown cancellation token is already cancelled at that moment because you use Stop method without awaiting. May be it's better to use StopAsync for children actors here?

ActorContext:

Expected behavior: all grains and actors receive Stopping/Stopped events.

AqlaSolutions commented 7 months ago

Also is there a way to use Poison pill instead of Stop when shutting down?

rogeralsing commented 7 months ago

Graceful shutdown is only in relation to how the member leaves the cluster. meaning it will try to properly deregister from the cluster provider and gossip to other members that it is leaving.

That being said. it would be perfectly possible to make the IIdentityLookup also wait for all actors to stop. And maybe that is the conceptually correct thing to do here.

Open for discussion here

AqlaSolutions commented 7 months ago

According to the IntelliSense docs for Cluster.ShutdownAsync graceful parameter is meant to gracefully shutdown all grains.

rogeralsing commented 6 months ago

This is now present in this merged PR https://github.com/asynkron/protoactor-dotnet/pull/2121

This is clearly an area that could use some more thought. e.g. should the graceful stop poison the actors, or just hard stop? cc @mhelleborg

mhelleborg commented 6 months ago

This is now present in this merged PR #2121

This is clearly an area that could use some more thought. e.g. should the graceful stop poison the actors, or just hard stop? cc @mhelleborg

I think both ways can make sense. Stop does gives the actors the opportunity to save state, so it might be "graceful enough", while I can imagine situations where you would want to allow the actors to complete its current messages, although it could potentially be slow.

@rogeralsing We could give the caller the ability to choose which strategy to use, potentially with a hard deadline after which it does the hard stop?

AqlaSolutions commented 6 months ago

There could be an intermediate state in the queue that is not stored in the sender anymore and haven't been processed by the receiver actor yet. In such case it's necessary to poison. The real question is how to prevent new requests to be put into the queue, especially when another node doesn't know anything about the shutdown. What if some actors need to perform requests to others in their Stopped handler? We can't disconnect from cluster also because the same grain instance may spawn on another node while the previous instance is still finishing its shutdown.