Having an Orleans grain that is always active for periodic short interval background processing

dotnet / orleans

Cloud Native application framework for .NET

https://docs.microsoft.com/dotnet/orleans

MIT License

10.07k stars 2.03k forks source link

Having an Orleans grain that is always active for periodic short interval background processing #1419

Closed Eldar1205 closed 8 years ago

Eldar1205 commented 8 years ago

Hi everyone,

Please let me know if an issue isn't the right way to ask questions and discuss design.

I plan to use Orleans in a new Azure cloud service since its virtual actors model fits greatly to most of the scenarios that should be handled by that service, but there is one scenario that raises the following question: Is there a way in Orleans to make sure N reminder grains of same type are hosted on different silos, and/or control on which silo they are activated based on custom placement logic?

Motivation: I want to use Orleans in an Azure cloud service that runs background processing and should be highly resilient to failures as well as silos downtime due to in-place upgrade. In my scenario there is a need for a grain that runs short I/O intensive background processing tasks every 10 seconds, and if it’s hosting silo goes down then it should be re-activated again on another silo in no more than 60 seconds. During an Azure cloud service upgrade the instances are divided into 5 groups, one group is upgraded at a time, which means during an upgrade I always have 20% of my instances shut down, which makes it challenging to have an "always active" grain. I consulted with Sergey Bykov and we came up with several ideas, one of them was introducing “keep alive” grains that their purpose is to ping the “always active” grain to make sure it’s active. In order to make that work during upgrades, I need to make sure there is always at least one “keep alive” grain that its hosting silo isn’t in the same upgrade group as the silo that hosts the “always active” grain, so that “keep alive” grain will be able to make the “always active” grain be re-activated on another silo that isn’t shutdown at that moment. The "keep alive" grains will be activated by long interval reminders and register a short interval timer, so as long as at least one "keep alive" grain is alive it will activate the "always active" grain. I could register (0.2 x #instances + 1) "keep alive" grain reminders to make sure there is always at least one alive during an upgrade, but I'd rather have a more sophisticated approach that creates less "keep alive" grains on the correct silos.

centur commented 8 years ago

Do you have the strict requirements on time intervals, e.g. your 10 seconds - can it be 11 or 15 seconds or such lapse is unacceptable?

If you're interested in reliable 'tick' rather than a precise schedule, I can give you crazy idea on how to do it reliable "outside" of Orleans. Previously, on another project we came to an interesting concept of "free pulsing schedule" based on Azure Queues (we thought through the concept but didn't implemented it physically ) - you can enqueue a placeholder message with a given invisibility interval, and never delete it (or delete it after certain logic if needed) so your custom pulling agent will be your timer - message pulses in a queue for at least 10secs (or any invisibility timeouts ) intervals between pulls (so it's not guaranteed to trigger every 10 seconds frequency, but at least 10 seconds between pulls + your pulling agent interval).

You can use one queue for multiple scheduled routines by just putting message invisibility interval inside message itself, as well as you can add some extra details for the puller to invoke your grain. You can schedule something manually by queueing such message externally.

Effectively what it does - you're moving your "trigger signal" outside of your infrastructure (if you don't think your code is reliable enough ho handle certain events) and have ability to set up frequencies and some extra metadata.

Downsides - you're relying on the precision of this external service - max frequency, as well as, min frequency depends on invisibility timeout range and if Azure Queue lapses certain guarantees e.g. message in queue won't re-appear after 10 seconds but after 15 - you'll miss your tick. Also, if your puller agent stuck and didn't pull messages - your schedule is stalled. Once pulling agent gets back to normal - you will trigger a cascade of "catching up" ticks.

Eldar1205 commented 8 years ago

Thanks for the response!

I had a design brainstorming with Sergey Bykov and the idea of a "reminder queue" with stateless workers pulling from that queue was brought up as a way to implement a custom reminder service that can support faster persistent reminder ticks than Orleans provided reminders. While this provides very interesting management/control features and reliability, I'm exploring options how not to have a queue, since it's another possible point of failure, another component to monitor, etc. It's possible an virtual actors abstraction isn't meant to provide a level of control over specific silo placements, but I think a good framework also provides advanced low-level management operations for infrastructure use cases which could later be introduced as part of the framework itself.

jdom commented 8 years ago

Even though grains are typically reactive, I think there are several different ways to implement this:

You mentioned using reminders to keep a certain number of grains alive all the time, that would then ping the "always active" grain frequently.
Another is to use bootstrap providers so that it creates 1 stateless worker when the silo starts up, and use grain timers inside it to ping the "always active" grain every 10 seconds. This way you'll always have 1 custom reminder grain per silo. Others feel free to poke holes in this, as I'm not completely sure the conditions when a stateless worker grain gets deactivated.
Beware, this might be even more hackish, but it might be worth trying if option 2 does not guarantee keeping the stateless worker alive: have the bootstrap provider start a never-ending async loop (but be sure to have the Init method return quickly, don't block that task indefinitely), that every 10 seconds pings the always active grain. You will need to be careful to capture the context for the loop to use, otherwise you won't be able to do any grain calls.
Option closer to what @centur proposed (in terms of having some external HA service to support you): have several instances of these "always active" grains (as opposed to several instances of your custom reminder grains), and keep them alive through Orleans reminders. Do some kind of leader election so all but one of these instances will be idle except for trying to grab a resource, for example try to acquire a blob lease (see Leader Election Pattern for an example of the blob lease approach). The one that acquired the lease can perform the operation. If the operation is idempotent, and for additional resiliency to mitigate your "another point of failure" risk, when blob storage is unavailable, have all of your instances perform the critical operation. Normally you'd have 1 grain working, but very infrequently you might have more of them.

Eldar1205 commented 8 years ago

Hi jdom.

The required end result is to have a grain that is always active with sub-minute downtime, and achieve that using Orleans built in abstractions (e.g. Reminders, Bootstrap providers,...). I already have a solution with a component external to Orleans, rather avoid it. I'll need to support up to tens of thousands of "always active" grains of the same type, differentiated by their ID. That amount of reminders is something Orleans handles well with enough instances, so the pinging mechanism should be smart enough to not create millions of pings.
The stateless worker can keep it's alive as much as it wants or simply be configured to Orleans with an age limit of 1000 years. But how would that help? Which "always active" grains should the stateless worker ping? It should be grains in other silos so if they go down the current silo makes sure those grains are re-activated.
There would be too much "always active" grains to let one stateless worker per silo deal with it.
If I have to introduce an additional self-managed storage component I prefer the external HA service over blob leasing since it provides more flexibility. Additionally, with Orleans I don't need leader election with blobs - I simply invoke a cluster-level singleton grain (e.g. GrainFactory.GetGrain(0))

jdom commented 8 years ago

Ok, the way I understood at first is that you only really need very few "always active" grains, so my suggestions were geared towards that, sorry. Nevertheless, I'm sure we are miss-understanding each other, since the objections seem to be about something different. BTW, the intention with each number was to propose a different solutions, they were not steps of just 1 solution.

1 I already have a solution with a component external to Orleans, rather avoid it.

I was just re-stating your proposed solution here to contrast with the other approaches.

There would be too much "always active" grains to let one stateless worker per silo deal with it.

Yeah, proposals 2 and 3 might not scale well if you have several thousands of "always active" grains. They were effectively the same proposal, to materialize 1 of these custom reminders/keep-alive, except than in 2 it was a stateless worker pinging, and in 3 it was a background process pinging.

4 If I have to introduce an additional self-managed storage component I prefer the external HA service over blob leasing since it provides more flexibility. Additionally, with Orleans I don't need leader election with blobs - I simply invoke a cluster-level singleton grain (e.g. GrainFactory.GetGrain(0))

Here's where I can't understand whether we are talking about the same thing. The Highly Available service I was talking about is precisely blob storage (not sure I know which one you refer to). And yes, GrainFactory.GetGrain(0) will give you a cluster level singleton, but what I was suggesting is to make sure there is always at least 1 activation of that critical function running at all time, by having multiple grains compete for a blob lease. Still, in light of this new scale information, it still might not be a correct approach.

Eldar1205 commented 8 years ago

After I thought about it more I realized there might be a very simple solution to keep grains "always active". Let's say I have a stateless worker per silo that is alive as long as the silo is running. If that worker could access the grains directory and know, given a grain id, if it would be activated by current silo or not, then that worker can be the one that keeps the "always active" grains active by simply activating them, without external queue and even without TCP latency - the grains that should be activated on current silo will be activated by the stateless worker local to that silo. When silos leave/join the cluster, their stateless worker shuts down, eventually the grains directories on all silos alive are updated with new grains ids distribution, and the stateless workers running on them will make the "always active" grains are re-balanced between them. So the big question is - does Orleans expose API to query for given a grain id(s) whether they would be activated on current silo without activating them? Note: I know the stateless worker can simply activate them and ask them if they are on same silo as him, but I'd prefer not to incur that potential TCP request if the grain would be activated on another silo.

2016-02-10 0:43 GMT+02:00 Julian Dominguez notifications@github.com:

Ok, the way I understood at first is that you only really need very few "always active" grains, so my suggestions were geared towards that, sorry. Nevertheless, I'm sure we are miss-understanding each other, since the objections seem to be about something different. BTW, the intention with each number was to propose a different solutions, they were not steps of just 1 solution.

1 I already have a solution with a component external to Orleans, rather avoid it.

I was just re-stating your proposed solution here to contrast with the other approaches.

There would be too much "always active" grains to let one stateless worker per silo deal with it.

Yeah, proposals 2 and 3 might not scale well if you have several thousands of "always active" grains. They were effectively the same proposal, to materialize 1 of these custom reminders/keep-alive, except than in 2 it was a stateless worker pinging, and in 3 it was a background process pinging.

4 If I have to introduce an additional self-managed storage component I prefer the external HA service over blob leasing since it provides more flexibility. Additionally, with Orleans I don't need leader election with blobs - I simply invoke a cluster-level singleton grain (e.g. GrainFactory.GetGrain(0))

Here's where I can't understand whether we are talking about the same thing. The Highly Available service I was talking about is precisely blob storage (not sure I know which one you refer to). And yes, GrainFactory.GetGrain(0) will give you a cluster level singleton, but what I was suggesting is to make sure there is always at least 1 activation of that critical function running at all time, by having multiple grains compete for a blob lease.

— Reply to this email directly or view it on GitHub https://github.com/dotnet/orleans/issues/1419#issuecomment-182113591.

sergeybykov commented 8 years ago

So the big question is - does Orleans expose API to query for given a grain id(s) whether they would be activated on current silo without activating them?

The grain directory API is internal. But I don't think you need it here. You could simply have a static concurrent dictionary within the silo app domain and have grains add/remove their ID there from within their OnAcitivateAsync()/OnDeactivateAsync(). Violation of isolation, but you are already hacking around.

nkosi23 commented 8 years ago

I love the idea of leveraging Reminders I have a question : how well do Reminders/Timers scale ?

Is it fine to have in the hundred of thousands of reminder ?
I've read in the documentation that you mention that reminders are persistent, so here's my question : is it possible to attach a Reminder to a grain without activating it ?

For 2. my use case is that grains are associated with a persistent storage record, and I'd like to set up grain-specific reminders. For example, let's imagine that I have a grain type called Party. Each party has a start time and an end time. I'd like that some process be done in the grain every minute when the party is taking place and when it's taking place only. This is the grain-specific Reminder I was mentioning.

Now, the database will likely contain millions of archived Party records, and it would be inefficient to activate them all. So I am wondering if, provided that the storage layer can expose only active 'Party' grains to Orleans, there is a built-in mechanism to get Orleans to attach the reminders at startup time.

I'm trying to figure how I can leverage Orleans by remaining in the intended usage, and if this falls within the intended usage without getting hacky :)

sergeybykov commented 8 years ago

Should be fine
Not supported today. I guess one could create a hacky side tool that would call the reminder service directly, and register reminders on behalf of grains.

I'd like that some process be done in the grain every minute when the party is taking place and when it's taking place only. This is the grain-specific Reminder I was mentioning.

So the grain will get activated, right? And the grain could register a reminder when the party starts, and unregister it when the party ends? That would be the typical reminder scenario. Why do you need to register a reminder from outside of the grain then?

nkosi23 commented 8 years ago

Many thanks for your answer

Why do you need to register a reminder from outside of the grain then ?

I had in mind the scenario were for some reason, there would be preexisting "grains" in the database - I mean prior to the introduction of Orléans in a project. But I'm realizing that this situation is very unlikely to happen in practice - and even if it happens it'd be rather easy to explain users the forward compatibility.

Ok so that should be good enough for me. Sounds very exciting.