dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.14k stars 2.04k forks source link

[Proposal] Grain directory leases #9225

Open ReubenBond opened 1 week ago

ReubenBond commented 1 week ago

Fixes #2428 Fixes #5687 Fixes #8242

In #9103, we introduced a strong consistency directory, leveraging the strong guarantees which Orleans' powerful membership provides, as discussed in #1323. This proposal is for a mechanism to go the last mile and offer strong single activation guarantees by means of leases. The new grain directory is strong consistency already, but strong single activation guarantees rely on evicted silos ceasing operation when there is a potential for a grain to be activated elsewhere. Leases are the only practical way to implement this kind of guarantee (see this comment).

The proposal is to add an implicit leasing mechanism based on membership which silos and the directory will use to self-terminate/deactivate activations and to prevent registrations respectively. The proposed mechanism is this:

  1. Instead of evicting registrations from the directory when a silo is evicted, leave a tombstone entry indicating the latest possible membership version the silo was evicted in.
  2. Disallow deregistration of those tombstone entries until at least a certain time passes since the silo was evicted. This involves keeping track a list of which membership updates have been seen locally, and when. Skipped updates are ok: the directory pessimistically chooses the newer update as the start time for lease expiration.
  3. If a silo does not manage to refresh its membership within the leasing period, it self-terminates.

The valid leasing period must be calculated based on the membership refresh interval. Leases are extended whenever a new membership version is received by a silo.

nkosi23 commented 1 week ago

This is probably a layman question, but would this proposal have meaningful negative implications on throughput if an expired lease has to be checked / confirmed before activating a new grain?

My understanding is that this would not have any negative impact for grains being already active since the lookup process would be mostly unaffected.

rkargMsft commented 1 week ago

Is the tradeoff for this that there's a stronger guarantee that there won't be duplicate activations during the lease period (and ideally no duplicates since the old silo will terminate itself if it can't renew its lease). But there's a longer period where an old, unreachable silo will still be seen to hold the lease so activations won't be placed elsewhere until that lease is given up?

ReubenBond commented 1 week ago

would this proposal have meaningful negative implications on throughput if an expired lease has to be checked / confirmed before activating a new grain?

No, this does not impact performance. It slightly affects directory hand-off & crash recovery just because we aren't omitting activations hosted on crashed silos, but that is not meaningful.

My understanding is that this would not have any negative impact for grains being already active since the lookup process would be mostly unaffected.

That is correct. Leases are checked centrally, periodically, not at the per-grain level.

But there's a longer period where an old, unreachable silo will still be seen to hold the lease so activations won't be placed elsewhere until that lease is given up?

Yes, that's right: this feature necessarily decreases availability of some subset of grains after a crash.

Specifically: grains known to be hosted on a crashed silo (i.e, registred to other partitions), and grains which were potentially hosted on the crashed silo (i.e, grains belonging to the directory ranges owned by the crashed silo which are not known to be hosted elsewhere).