dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.1k stars 2.04k forks source link

[Docs] [Question] Resolving split-brain scenario #8242

Open peter-perot opened 1 year ago

peter-perot commented 1 year ago

Almost all code examples that involve stateful grains write data to the storage but avoid reading them back programmatically. The reason for this practice is that Orleans automatically reads the state on activation, and when there are no other "leaks" that change the state on the underlying database (e.g. as done by stored procedures), there is no reason to read the data back once it is cached in memory by the grain.

But now consider a split-brain scenario, where grain A reads the state at time T1, and a duplicate activation A' of the same grain (due to split-brain) gets an update call the state of which is written to the storage at time T2 > T1.

Now Orleans detects the split-brain scenario and tries to resolve it. But how? I think we need a more detailed documentation here how Orleans achieves this.

Strategy 1: If Orleans only deactivates grain A', then grain A with a stale state stays in memory, and future read-requests will all return stale data.

Strategy 2: If Orleans deactivates both grains, then the next read-request will reinstantiate the grain and re-read the latest state, so we get the correct state in this case.

Q1: What is the strategy that is implemented in Orleans?

Q2: In case the first strategy is implemented, what are the recommended countermeasures to avoid staleness? Do we need to read the state from storage on each read-request? [Clarification: Interim staleness can be tolerated, but staleness forever (in case the state is not changed again) is no option.]

There are two options for Q2 that come into my mind:

Option 1: Re-read the state when a read-request comes in and the last refresh from the storage is older than a defined period.

Option 2: Re-read the state when a read-request comes in and the last refresh from the storage is older than a defined period, but don't suspend the current call until the re-read has finished. Instead return the current potentially stale state and trigger a refresh in the background (with a non-awaited task).

nkosi23 commented 1 year ago

Indeed, a more detailed documentation could be useful on this point. I have found this post describing the activation logic. I do not think things have changed a lot since then so hopefully this is useful:

Let me give you an example, we have three nodes: B, C, and D. In node C activated Grain[A-1], the activation information is saved on Node B. This time the new node A joins the cluster, node A is the direct forward of B, and the information of Grain[A-1] now belongs to A. At this point, when the status of node A becomes active, node B begins to process the AddServer event and prepares to give the Grain[A-1] location data that it retains to node A. But at this point, node D generates a call to Grain[A-1], at which point node D will see if the location information is on node A, but because node B has not given the data to node A, so node D found no grain[A-1] location information, then according to the default policy, will randomly select a node to activate, Assuming node D is selected, then D will have a grain[A-1] activation. The cluster will now have two grain[A-1] activations.

I'd say if you absolutely cannot afford stale data in a specific scenario, the best approach as of today may be to always read the state from database and never rely on the cache. The difficulty here is that Orleans cannot reliably help us for example by notifying us when a duplicate activation has been resolved, since when the cluster is not stable all nodes do not necessarily have the same view of the cluster, so the winning activation may not even know that another activation existed.

Maybe using a storage-based grain directory would help with that (instead of the default in-memory directory) however the notification API does not exist for now so this may not be a solution right now.

Edit: Also I just remembered that if you use the persistence framework provided by Orleans, you get ETag checks for free, which gives you the peace of mind of knowing that if the ETag you have in-memory differs from the one in the database the transaction will be rejected and throw an exception. If you do not or cannot use Orleans' built-in persistence framework, rolling out a similar approach by hand may be a solution (or using a database such as CouchDB which automatically uses a revision sequence number similar to Etags).

As far as reads are concerned, only you can determine when reading stale data would be unacceptable for your application. My experience is that strictly instant consistency is rarely needed in practice. Also keep in my that you have the option of exposing two APIs: ReadPotentiallyStale() and CheckEtagAndReloadFromStoreIfNeededBeforeRead() depending on the consumer to avoid premature optimization.

Also, another useful piece of information is that an old documentation website says that the system will converge toward the single activation guarantee within 30 seconds depending on configuration:

In what cases can a split brain (same grain activated in multiple silos at the same time) happen?

During normal operations the Orleans runtime guarantees that each grain will have at most one instance in the cluster. The only time this guarantee can be violated is when a silo crashes or gets killed without a proper shutdown. In that case, there is a ~30 second (based on configuration) window where a grain can potentially get temporarily instantiated in more than one silo. Convergence to a single instance per grain is guaranteed, and duplicate activations will be deactivated when this window closes.

Also you can take a look at Orleans' paper for a more detailed information, however you don't need to understand it fully to be able to write your application code. You just need to consider the rare possibility of having two instances of an actor while writing your application. The persistence model guarantees that no writes to storage are blindly overwritten in such a case.

The relevant sections of the paper in question say:

3.2 Distributed Directory Many distributed systems use deterministic placement to avoid maintaining an explicit directory of the location of each component, by consistent hashing or range-based partitioning. Orleans allows completely flexible placement, keeping the location of each actor in a distributed directory. This allows the runtime more freedom in managing system resources by placing and moving actors as the load on the system changes. The Orleans directory is implemented as a one-hop distributed hash table (DHT) [17]. Each server in the cluster holds a partition of the directory, and actors are assigned to the partitions using consistent hashing. Each record in the directory maps an actor id to the location(s) of its activations. When a new activation is created, a registration request is sent to the appropriate directory partition. Similarly, when an activation is deactivated, a request is sent to the partition to unregister the activation. The single-activation constraint is enforced by the directory: if a registration request is received for a singleactivation actor that already has an activation registered, the new registration is rejected, and the address of the existing activation is returned with the rejection. Using a distributed directory for placement and routing implies an additional hop for every message, to find out the physical location of a target actor. Therefore, Orleans maintains a large local cache on every server with recently resolved actor-to-activation mappings. Each cache entry for a single-activation actor is about 80 bytes. This allows us to comfortably cache millions of entries on typical production servers. We have found in production that the cache has a very high hit ratio and is effective enough to eliminate almost completely the need for an extra hop on every message.

3.9 Eventual Consistency In failure-free times, Orleans guarantees that an actor only has a single activation. However, when failures occur, this is only guaranteed eventually. Membership is in flux after a server has failed but before its failure has been communicated to all survivors. During this period, a register-activation request may be misrouted if the sender has a stale membership view. The target of the register request will reroute the request if it is not the proper owner of the directory partition in its view. However, it may be that two activations of the same actor are registered in two different directory partitions, resulting in two activations of a singleactivation actor. In this case, once the membership has settled, one of the activations is dropped from the directory and a message is sent to its server to deactivate it. We made this tradeoff in favor of availability over consistency to ensure that applications can make progress even when membership is in flux. For most applications this “eventual single activation” semantics has been sufficient, as the situation is rare. If it is insufficient, the application can rely on external persistent storage to provide stronger data consistency. We have found that relying on recovery and reconciliation in this way is simpler, more robust, and performs better than trying to maintain absolute accuracy in the directory and strict coherence in the local directory caches.

peter-perot commented 1 year ago

The paper says that a split-brain is always resolved after a 30s period, but not how this is achieved: Are all grains of the bunch of duplicates deactivated or only n-1?

Maybe a hook could help that notifies the "survivor" grain that a split-brain has occurred and resolved (in case of the n-1 approach).

And it would be helpful to give some recipe how to handle split-brain, because in case we have a grain that is writing state very rarely, reading state very often and then running into a split-brain scenario when state is written, the staleness could span days (until the next write happens).

nkosi23 commented 1 year ago

@peter-perot stale grains will be deactivated after being idle for some time like any other grain (default is 2 hours if I remember correctly). This is normally not a problem since grain calls are routed to the right activation once nodes have converged toward a single view of the cluster. The only issue may be if you have setup a timer on the grain to persist its state on regular time intervals. While timers cannot delay idleness-related deactivation, this could corrupt your state unless you use etag-like consistency checks in your storage layer.

The survivor grain may not even know about the other activation, this is the difficulty here. Also it is not possible to provide general guidelines as the acceptable tradeoffs depend on your specific application. What I have highlighted are the basic patterns to follow:

peter-perot commented 1 year ago

@nkosi23 wrote:

@peter-perot stale grains will be deactivated after being idle for some time like any other grain (default is 2 hours if I remember correctly). This is normally not a problem since grain calls are routed to the right activation once nodes have converged toward a single view of the cluster. The only issue may be if you have setup a timer on the grain to persist its state on regular time intervals. While timers cannot delay idleness-related deactivation, this could corrupt your state unless you use etag-like consistency checks in your storage layer.

I know, but the write rarely, read often scenario I described above can nevertheless lead to a state that is stale for days, especially when reading happens at least once per 2-hour period. In this case a stale grain is not deactivated and therefore not reloaded.

Concerning etags: I always use etags (aka concurrency tokens) for write operations to avoid overwriting data with stale data.

The survivor grain may not even know about the other activation, this is the difficulty here. Also it is not possible to provide general guidelines as the acceptable tradeoffs depend on your specific application. What I have highlighted are the basic patterns to follow:

  • Never rely on in-memory state, which is quite extreme
  • Use etag like consistency checks either using the built-in persistence framework, or an external storage having such a feature (or by rolling your own etag mecanisms for the relevant entities)

One recipe to avoid staleness could be the following pattern: We provide three variants of read methods:

Optimization approach: In case we have a complex aggregate that is cached, we could compare the etags of the aggregate root, i.e. the etag that is in memory with that in the storage. When the etags are not equal, data is reloaded. This optimization combined with method ReadCachedAndRefreshInBackground() could keep latency and staleness low in tandem, and the database is not flooded with complex query requests when data is written rarely, so the etag comparison is sufficient most of the time for an up-to-date check.

nkosi23 commented 1 year ago

I know, but the write rarely, read often scenario I described above can nevertheless lead to a state that is stale for days, especially when reading happens at least once per 2-hour period. In this case a stale grain is not deactivated and therefore not reloaded.

Actually in this case the stale grain would not even receive the read calls: it would stop receiving any call within 30 seconds (all invocations will be routed to the winning grain instead) and therefore the stale grain would be deactivated after 2 hours in the default configuration.

peter-perot commented 1 year ago

@nkosi23 wrote:

I know, but the write rarely, read often scenario I described above can nevertheless lead to a state that is stale for days, especially when reading happens at least once per 2-hour period. In this case a stale grain is not deactivated and therefore not reloaded.

Actually in this case the stale grain would not even receive the read calls: it would stop receiving any call within 30 seconds (all invocations will be routed to the winning grain instead) and therefore the stale grain would be deactivated after 2 hours in the default configuration.

Unfortunately this is not the case. Maybe I have to clarify the notion of stale in this context: When I'm saying stale, I'm not talking about the doublet that is part of the split-brain scenario. Instead, I'm talking about a grain the in-memory state of which does not reflect the persisted state. Here the scenario:

Do you see the point? And the question is: What is the best recipe to avoid such a situation?

And it would be nice to know from the docs how Orleans resolves a split-brain scenario, so we don't have to guess the strategy or reengineer it from the source code.

ReubenBond commented 1 year ago

Duplicate activations can occur in a few cases, and some are unavoidable. For example, a grain may be activated on an already-evicted silo which simply has not yet learned that it's been evicted (imagine that it also thinks it owns the directory partition for that grain).

So, a silo which is not even a member of the cluster (according to the other silos and the latest version of the membership table) can create activate a grain and load the grain's state from storage and then modify it without the other activation of that grain (on a healthy node in the cluster) ever being aware.

That is assuming that the grain does not perform writes to ensure that it indeed has the latest state. If a grain performs a read/write on every operation, then there is no issue, since the DB can be used to coordinate.

So, if you need to guarantee that you have the latest state and you don't want to perform DB operations to guarantee it on each request, you likely need some form of leasing (caveat emptor). This is the idea behind #2428. It may also be worth reading my last comment on that issue: https://github.com/dotnet/orleans/issues/2428#issuecomment-1083123197

peter-perot commented 1 year ago

So, if you need to guarantee that you have the latest state and you don't want to perform DB operations to guarantee it on each request, you likely need some form of leasing (caveat emptor). This is the idea behind #2428. It may also be worth reading my last comment on that issue: #2428 (comment)

@ReubenBond, thanks for the link to the leasing strategy proposal. Interesting to read, but if I understand it correctly, the discussion is rather about an extension in the Orleans core than something a user of Orleans can implement on its own, right?

ReubenBond commented 1 year ago

Yes, you're right. The general strategy can also be implemented in user code.

slawomirpiotrowski commented 4 months ago

Three ideas that could be useful, but has limitations:

Idea 1: 1) During actual read of grain's state from persistence write your Silo ID to persistence and remember time when you did it (per grain). 2) Before writing changes to grain's persistence read data from membership table and make sure that you are not declared dead there. Remember time when you did it (per silo). If you have already done that after writing your Silo ID to given grain's persistence then you can skip this step. Throw if you can't read membership table (after few retries) or if you're declared dead. 3) When writing changes to grain's persistence confirm that your Silo ID is writtent in your grain's persistence (throw if it isn't).

It has major performance hits and disallow writes unless your membership table is available.

Idea 2: 1) Create separate database user accounts for each silo. 2) Delete database user account of a silo when it's declared dead.

It's only possible with database providers that allows creating many users. Impossible with some managed databases. Database systems doesn't usually change access rights of already authenticated sessions, but it can prevent an arbitrary long suspension of VM with Orleans' Silo. Connection to database would be dropped by database server so when declared dead silo would try to automatically reconnect (after resuming it's VM), it would fail.

So we would need to assume some time is needed to propagate revoking of access rights: time it takes to write to membership table and to revoke the rights, time it takes to propagate revokation if database system is distributed, time it takes until TCP connection of a suspended VM would be dropped by database system, time it takes for a resumed VM to reexecute it's write transaction (probably should assume it being equal to database access timeout). Sum those times, add some slack and assume it's our "suspicious time" (an upper limit of time when declared dead silo could write something outside of the cluster).

For each grain that reads it's state during "suspicious time" assume grain's state is suspicious, automatically reload grain's state before next request if previous grain's state was marked suspicious (or just compare ETag if reading the state is complicated).

Idea 3: The same as idea 2, but without messing with access rights.

Instead assume VM with Orlean's Silo is not allowed to be suspended or paused in the debugger. Just after any silo being declared dead, assume some fixed amount of time is our "suspicious time" (give some time for GC and so on).

I think that most of this ideas could be implemented in user code, without changing Orleans code. Using custom persistency and custom clustering.

Would be easier if application could subscribe to notifications about any silo being declared dead.