Current Implementation

The partition actor is responsible to know where one partition of gam actors is spawned and also handles its activation.

The partition is defined by a hash ring where all members that support the kind of actor are participant
A request for a virtual actor is routed to the partition actor where the member hash is closest to the actor name hash.
The partition actor spawns a remote actor using a random remote host and the normal remote activator interface.
The partition actor caches the remote activation

There is an additional cache implemented in pidCachePartitionActor

spawned on each node
used to cache the remote location of an actor, without asking the responsible partition every time.
This cache is only invalidated when the actor is terminated by watching the remote pid.

When the cluster member list changes the partition actor tries to handover the actors that are no longer in its responsibility to a new partition.

Send a takeover message to responsible partition actors
Forget about its state.

Weaknesses:

The logical change of responsibility is directly acting on memberlist changes, while the takeover is asynchronous and delayed. A spawned remote actor is at least temporarily not known to any partition actor. A virtual actor that is getting new messages (from new nodes with empty local pid cache) will be respawned on a new random host. -> Duplicate virtual actors
When a partition actor leaves the cluster unpredicted, all its remote handles are no longer known to any other partition actor. -> Lost tracking of remote actors, duplicate virtual actors.

Possible Solutions

In the following I try to provide a collection of possible implementation direction. The goal is not to just implement one, but to discuss what is the best fitting approach.

Local Spawning

As a quick fix I tried out to spawn virtual actors directly on the node which partiton actor is responsible

Spawn locally
When partition responsibility shifts, terminate all actors that are now no longer in the responsibility of the local partition
automatically invalidate the pidCachePartitionActor when the responsible node shifted

I put this to test and this worked quite fine in a scenario with persistent, cassandra backed, actors, but it has two drawbacks:

There is no safe guard control that the new actor will not be spawned before the last event was handled
Actor moves are not desirable for every scenario.

So to make this available to others it would require a configurable strategy at least.

Activation Strategy Configuration

To further improve the activation behavior it would be possible to create a interface that provides different choosable activation strategies.

Benefits:

Allows configuration of local spawning

Drawbacks:

Still no controlled activation behavior

Configurable partition actor storage

When looking at orleans there seems to be a pluggable storage behind the lookup of remote actors. The option to persist the dictionary in an external store can be used to avoid lost state in case of unexpected losses in the cluster.

Benefits:

Reduces losses
Allows to stall new requests until the state is restored

Drawbacks:

Partition actor would have more complex logic.

Interface in front of full partition actor behavior

Instead of specific configuration for activation strategy and dictionary a interface in front of the partition actor would allow to experiment with completely different behavior without breaking changes to current API users.

Staged Cluster Join

When a cluster member joins its state should be "Joining". When a cluster member leaves its "Leaving"
During these stages a active, controlled handover is implemented that guarantees a consensus regarding actor responsibilities.
During the handover all activation requests for the members closest in the hashring have to wait or be denied.

In this section I would see all different variants of reaching such consensus. We could be looking either in direction of akka or orleans.

asynkron / protoactor-go

Improve Cluster: More controlled activation of virtual actors #169