Akka.Remote enhancement: disable or relax quarantines

Aaronontheweb commented 6 years ago

I'm taking @nvivo's suggestion from a long time ago and running with it here. In my humble opinion, the quarantining mechanism for Akka.Remote is more of an annoyance than it is helpful and users should have the option to totally disable it if they so choose. The pervading rationale I've picked up from talking to users running into quarantining issues being: "it is intensely more annoying to have to restart an entire node in my cluster than it is to miss a single Terminated notification from a deathwatched actor."

I get the reason why the JVM does this: to guarantee a level of consistency across applications when system messages fail to get delivered. Makes sense. However, I think the burden that this design choice puts on users is more onerous than is necessary.

Therefore, I'd like to make an option available for users who prioritize availability above consistency to say "I don't care if we lose a system message - let that be my problem and don't quarantine anything."

How many of our users would actually use this if we implemented it? Please weigh in here.

rogeralsing commented 6 years ago

A good start would be to list all cases where this might affect Akka.NET IMO. e.g. how would this affect cluster routers, or routers talking to remote nodes. potentially this could end up with routers having many dead routees.

If the responsibility of handling this ends up on the devs, at least they need to know what the implications might be.

(We scrapped quarantines in PA, instead doing it more like services doing service discovery to other services... identity can be simulated by having name + suffix. primitive, yet simple to reason about)

Aaronontheweb commented 6 years ago

The cases that are explicitly affected:

Remote deathwatch is no longer guaranteed to work
Remotely deployed actor supervision / restarts are no longer guaranteed to work

Has no impact on things like cluster group routing, but could cause major problems with systems like Cluster.Singleton, ClusterClient, DistributedPubSub, and Cluster.Sharding - all of which depend on Terminated notices for various things. Biggest issue though would be that end-user remote actor lifecycle monitoring would no longer be reliable either.

Removing quarantining altogether would get kind of hairy fast.

I'm actually starting to think that the relatively high occurrence of quarantines I've seen reported by some users (i.e. happens when things go beserk right at system startup or when making major changes to a running cluster, such as deploying a new version of the service) might be a bug:

http://getakka.net/articles/configuration/akka.remote.html

Technically speaking, the quarantine should only kick in if there are system messages that haven't been delivered for five days by default. We're seeing systems do that much more quickly. In that case, maybe I should focus my attention on whether or not we're implementing the quarantine protocol correctly - since that would actually be a better solution to this problem than trying to work-around it altogether.

akkadotnet / akka.net

Akka.Remote enhancement: disable or relax quarantines #3440