dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.06k stars 2.03k forks source link

Grain deactivation after the failure of node holding its partition #6732

Closed yevhen closed 2 years ago

yevhen commented 4 years ago

We boot silo local grain (poller) from the startup task which needs to live forever (we configured GC to AlwaysOn). But after some time it is deactivated by the runtime. Looking at logs, I can see that:

Catalog is deactivating 10 activations due to a failure of silo S10.0.37.223:100:337005433/x0A8DCEBD, since it is a primary directory partition to these grain ids.

We run the cluster on spot instances and evictions happen all the time. The problem is that Orleans assumes it may deactivate the grain since it will be reactivated later by request, which is not true for this kind of grains.

Any ideas on how we can fix that?

tillr commented 4 years ago

Not sure if I understood correctly what you are trying to achieve, but deactivation of idle grains is a built-in mechanism of Orleans to free unused resources. However, the framework offers means to delay deactivation. Within a grain, you can use

this.DelayDeactivation(TimeSpan)

to delay the deactivation of a grain for the specified timespan. This will not prevent the grain from being deactivated on silo failure or shutdown, however. To overcome this, you can use a combination of Timer and Reminder to achieve this:

  1. At silo start-up, send a message to the grain you need to start up and keep alive (via a silo start-up task).
  2. Within the grain in question, delay deactivation of the grain using DelayDeactivation at start-up and optionally in timer callbacks. If the silo fails or an exception occurs within your grain, the grain will be deactivated and timers will cease to work. That's by design.
  3. To "survive" silo failures, you can register a reminder on grain activation. Reminders are persistent and will work even if your grain has been deactivated between reminder callbacks. This means that reminders will fire and re-activate despite of grain/silo failures. The reminder interval must be 2 minutes or greater, however. So what you could do is register a reminder which "pings" the grain every 2 minutes to keep it alive or re-activate it in case a grain/silo failure has occurred in the meantime.

Using this approach, the maximum downtime of your grain would be about 2 minutes, assuming the worst case scenario where the grain/silo fails right after start-up. If this is not acceptable in your case, you might need an "external" task or process which periodically sends messages to your grain, with the downside of increased noise/traffic.

yevhen commented 4 years ago

It's not about idleness. This grain has an infinite idle timeout. That's about DHT partition failure

benjaminpetit commented 4 years ago

Are you sure that in your scenario you need a grain for this? Can you share more about what this grain is doing?

Otherwise as @tillr said, you should use a reminder to wake up the grain

yevhen commented 4 years ago

I need this grain to be local to a silo (aka Silongton). Reminder may spawn it on a random silo.

Anyway, I solved this with GrainService which periodically activates this grain. But this something to be aware of. Few times I saw SiloGrain from OrleansDashboard dying fro similar reasons (cc: @richorama)

UPD: Seems like SiloGrain is not bound to a particular silo 😃

benjaminpetit commented 4 years ago

From what I understand, I think a GrainService is more suitable for your use case.

The message and behavior that you saw is perfectly normal and expected. If you want to have Grain activations that are resilient to other Silo shutdown/failure, you should look at other Grain Directory implementation available: http://dotnet.github.io/orleans/Documentation/clusters_and_clients/grain_directory.html

ghost commented 2 years ago

We are marking this issue as stale due to the lack of activity in the past six months. If there is no further activity within two weeks, this issue will be closed. You can always create a new issue based on the guidelines provided in our pinned announcement.

ghost commented 2 years ago

This issue has been marked stale for the past 30 and is being closed due to lack of activity.