akka / akka-meta

This repository is dedicated to high-level feature discussions and for persisting design decisions.
Apache License 2.0
201 stars 23 forks source link

Akka Typed: Simplified Actor Lifecycle #21

Open rkuhn opened 8 years ago

rkuhn commented 8 years ago

While implementing the new akka.typed.internal.ActorCell based on akka.actor.ActorCell I wondered about the complexity of the internal mechanics in relation to the user-visible feature set. A significant part of the cognitive load for understanding the implementation is caused by the fact that the Actor and all its sub-hierarchy is suspended while waiting for a failure verdict from the supervisor. When that verdict comes back, we either Resume/Stop (the simple cases) or Restart (the most complex case). Restarting has the goal and benefit of keeping the mailbox around and the ActorRef stable across the act of handling a failure. Resuming has some built-in complications due to the ability to fail during actor creation and recover from it. Stopping is rather straight-forward since it is a one-way street with very simple semantics.

Proposal 1 (mostly a mental stepping stone)

Remove the ability to recover from failures during creation: instead of escalating an ActorInitializationException the actor terminates unconditionally. The supervisor should be informed about this abnormal termination by way of a FailedTerminated signal that does not allow any built-in reaction, a retry would have to be done in the logic handling this signal.

This would simplify internal book-keeping somewhat and remove some complex code paths that are needed for getting the right information into the right places (so that Resume can be turned into Create for example).

Proposal 2

A more radical proposal occurred to me when contemplating where the complexity of the ActorCell stems from: it is complicated both by hierarchical supervision and by the ability to restart with a stable ActorRef. Both of these deviations from the Actor Model (and from other implementations) are desirable, I am not questioning their existence. But with the new way of composing behaviors in Akka Typed I do think that we can do something about the complexity.

My proposal is to remove the suspension logic and asynchronous restartability from the ActorCell. Restarting can be modeled more efficiently by a behavior decorator that decides and recreates synchronously.

The consequence would be that only termination is signaled to the supervisor, including the information about whether it was of normal or abnormal origin. This will simplify the ActorCell tremendously, taking away the multitude of suspension races that define is current design. The suspension counter would be turned into an isTerminating flag that inhibits the processing of messages while waiting for the sub-hierarchy to shut down.

But what about fault-handling delegation?

The most important notion behind hierarchical supervision is that the handling of failures is delegated to a supervisor instead of burdening it onto the client (as is done by virtually every other framework). We modeled this very directly by message-passing since version 2.0-M1 for a very simple reason: remote deployment.

As discussed in #18 this feature should be removed, opening up other possibilities. The supervisor is of course free to enrich any child actor it creates with a behavior decorator that catches exceptions and reacts appropriately (e.g. by using the nested behavior factory to perform a full restart of the actor). If the supervisor wants to keep track of such restarts, normal messages can be sent from the decorator as notifications—since the primary responsibility for keeping the child actor running now lies with the child’s wrapped behavior, these notifications can be delivered by default using at-most-once semantics without sacrificing safety or liveness.

Another consideration is that composed behaviors within a single actor can also make use of the behavior decorators for specialized failure handling.

Interaction with DeathWatch

Non-parents who watch an actor should only be notified of that actor’s termination, regardless of the reason. Parents on the other hand should have the ability to react to abnormal termination differently than to normal termination. This raises the question of how to expose this difference in a way that is consistent with general DeathWatch.

One way would be to create a new feature that does not interact, meaning that spawning a child actor would generate parent notifications and watching that child actor would in addition generated the Terminated signal. The question would then be at which point it should be allowed to recreate a child actor with the same name as the failed one.

Another way would be to add a flag to the Terminated signal that would only ever be true within a supervisor after its child actor has failed. Reusing the signal would mean that failures are only communicated if the child actor has been watched—just like for normal termination. This would also play nicely with the DeathPact logic in that a one-off actor that is created without caring about its result will also not require any code to avoid the escalation of failures.

Other Benefits

One of the more troublesome questions with hierarchical failure handling is that with Akka Actor it feels a bit like we just moved the catch-block into the supervisor, including the burden of having to know about the child actor’s failure modes. The point of the let-it-crash pattern is precisely to avoid this kind of coupling, it should be enough to have the supervisor realize that an action is necessary without offering further details. The proposed change would express this shift of mindset quite nicely, separating the handled failures clearly from the unhandled ones.

More technically, this change would remove the horrendous hack of Failed.decide(verdict), allowing the notification to become immutable once again.

And the biggest one: SupervisorHierarchySpec will finally become understandable to mere mortals (including my current self).

The Plan

This whole discussion and proposal is of course isolated to the new implementation of Akka Typed, it has no bearing on the untyped actor implementation. One thing it does affect, though, is the ability of mixing typed and untyped actors within the same ActorSystem: it will likely turn out to be impractical to implement this interoperability feature, which would mean that systems migrating from untyped to typed mode will have two ActorSystems—sending messages from one to the other is of course still possible.

@akka/akka-team What do you think?

hseeberger commented 8 years ago

Great proposal (the second one)! I like the simplification.

I can't see any substantial issues for any of the projects I'm contributing to.

Heiko

Sent from my iPhone

On 12 Jun 2016, at 13:43, Roland Kuhn notifications@github.com wrote:

While implementing the new akka.typed.internal.ActorCell based on akka.actor.ActorCell I wondered about the complexity of the internal mechanics in relation to the user-visible feature set. A significant part of the cognitive load for understanding the implementation is caused by the fact that the Actor and all its sub-hierarchy is suspended while waiting for a failure verdict from the supervisor. When that verdict comes back, we either Resume/Stop (the simple cases) or Restart (the most complex case). Restarting has the goal and benefit of keeping the mailbox around and the ActorRef stable across the act of handling a failure. Resuming has some built-in complications due to the ability to fail during actor creation and recover from it. Stopping is rather straight-forward since it is a one-way street with very simple semantics.

Proposal 1 (mostly a mental stepping stone)

Remove the ability to recover from failures during creation: instead of escalating an ActorInitializationException the actor terminates unconditionally. The supervisor should be informed about this abnormal termination by way of a FailedTerminated signal that does not allow any built-in reaction, a retry would have to be done in the logic handling this signal.

This would simplify internal book-keeping somewhat and remove some complex code paths that are needed for getting the right information into the right places (so that Resume can be turned into Create for example).

Proposal 2

A more radical proposal occurred to me when contemplating where the complexity of the ActorCell stems from: it is complicated both by hierarchical supervision and by the ability to restart with a stable ActorRef. Both of these deviations from the Actor Model (and from other implementations) are desirable, I am not questioning their existence. But with the new way of composing behaviors in Akka Typed I do think that we can do something about the complexity.

My proposal is to remove the suspension logic and asynchronous restartability from the ActorCell. Restarting can be modeled more efficiently by a behavior decorator that decides and recreates synchronously.

The consequence would be that only termination is signaled to the supervisor, including the information about whether it was of normal or abnormal origin. This will simplify the ActorCell tremendously, taking away the multitude of suspension races that define is current design. The suspension counter would be turned into an isTerminating flag that inhibits the processing of messages while waiting for the sub-hierarchy to shut down.

But what about fault-handling delegation?

The most important notion behind hierarchical supervision is that the handling of failures is delegated to a supervisor instead of burdening it onto the client (as is done by virtually every other framework). We modeled this very directly by message-passing since version 2.0-M1 for a very simple reason: remote deployment.

As discussed in #18 this feature should be removed, opening up other possibilities. The supervisor is of course free to enrich any child actor it creates with a behavior decorator that catches exceptions and reacts appropriately (e.g. by using the nested behavior factory to perform a full restart of the actor). If the supervisor wants to keep track of such restarts, normal messages can be sent from the decorator as notifications—since the primary responsibility for keeping the child actor running now lies with the child’s wrapped behavior, these notifications can be delivered by default using at-most-once semantics without sacrificing safety or liveness.

Another consideration is that composed behaviors within a single actor can also make use of the behavior decorators for specialized failure handling.

Interaction with DeathWatch

Non-parents who watch an actor should only be notified of that actor’s termination, regardless of the reason. Parents on the other hand should have the ability to react to abnormal termination differently than to normal termination. This raises the question of how to expose this difference in a way that is consistent with general DeathWatch.

One way would be to create a new feature that does not interact, meaning that spawning a child actor would generate parent notifications and watching that child actor would in addition generated the Terminated signal. The question would then be at which point it should be allowed to recreate a child actor with the same name as the failed one.

Another way would be to add a flag to the Terminated signal that would only ever be true within a supervisor after its child actor has failed. Reusing the signal would mean that failures are only communicated if the child actor has been watched—just like for normal termination. This would also play nicely with the DeathPact logic in that a one-off actor that is created without caring about its result will also not require any code to avoid the escalation of failures.

Other Benefits

One of the more troublesome questions with hierarchical failure handling is that with Akka Actor it feels a bit like we just moved the catch-block into the supervisor, including the burden of having to know about the child actor’s failure modes. The point of the let-it-crash pattern is precisely to avoid this kind of coupling, it should be enough to have the supervisor realize that an action is necessary without offering further details. The proposed change would express this shift of mindset quite nicely, separating the handled failures clearly from the unhandled ones.

More technically, this change would remove the horrendous hack of Failed.decide(verdict), allowing the notification to become immutable once again.

And the biggest one: SupervisorHierarchySpec will finally become understandable to mere mortals (including my current self).

The Plan

This whole discussion and proposal is of course isolated to the new implementation of Akka Typed, it has no bearing on the untyped actor implementation. One thing it does affect, though, is the ability of mixing typed and untyped actors within the same ActorSystem: it will likely turn out to be impractical to implement this interoperability feature, which would mean that systems migrating from untyped to typed mode will have two ActorSystems—sending messages from one to the other is of course still possible.

@akka/akka-team What do you think?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

patriknw commented 8 years ago

Sounds good to take advantage of that children can only be local.

Another requested feature, which might be related, is to allow for backoff restarts with bounded/dropping mailbox.

Not being able to mix typed and untyped doesn't sound so good. It will for example take time until we have all untyped features in typed, e.g. Cluster, Persistence, Sharding, ...

ktoso commented 8 years ago

Heh my mailbox was too overwhelmed to respond timely here – yeah I'd really like to investigate if we can build in backoff if we change supervision.

rkuhn commented 8 years ago

Backoff implies that there is a period during which the Actor is suspended—not running even though messages are in the mailbox. This is precisely what complicates the internals a great deal and what should not really be there—my opinion is that this also overloads the meaning of the mailbox with other purposes that are not strictly needed (in particular it goes against the Actor Model).

What we could do is to add local stashing for bounded amounts and periods to the behavior decorator that implements restarts, where one issue is how to drain the stash without monopolizing the executor thread.

Maybe we should first answer what the purpose is of allowing backoff restarts with stable mailbox. Is this really something we want to offer? Whatever the reason is, it must deal with the thundering herd problem that is created by any form of prolonged suspension.

patriknw commented 8 years ago

I think it can drop all messages during backoff restart, if that makes it easier.

The difference from stopping, backoff, start, is that clients don't have to care about getting a new actor ref.

We have both variants today as a separate actor, but it was never perfect.

rkuhn commented 8 years ago

Right: we can fold those separate actors into the same actor by way of a Behavior decorator. We could even allow the user to specify a replacement function that computes an “out of office” reply during the backoff period.