dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.05k stars 2.02k forks source link

Silo should forward request when asked to shutdown #2287

Open benjaminpetit opened 7 years ago

benjaminpetit commented 7 years ago

If I'm not mistaken, when a silo properly shutdown (not stop) it should:

  1. Finish the requests that are being processed on locally activated grains (requests that were started to process before shutdown was called)
  2. Then forward the remaining non processed requests to another silo so they can be completed by someone else

It seems that step 2. is not properly done. In this test https://github.com/benjaminpetit/orleans/commit/53631d95df931e6c416170d92bd6b4caeb0b1f45 the await promisesAfterShutdown throws TimeoutException

Am I missing something?

gabikliot commented 7 years ago

Lets check in this test, with properly marked throwing exception if it indeed fails, and then I will take a deeper look. From a brief look I took now, sounds like you are correct, the 2nd call should have succeeded (HostedCluster.StopSilo is actually doing shutdown -> confusing. The grain in non reentrant.)

benjaminpetit commented 7 years ago

I moved the test in another branch, removed the functional category tag and sent this PR #2292

gabikliot commented 7 years ago

In https://github.com/dotnet/orleans/pull/2292#issuecomment-253379088 @benjaminpetit wrote:

I am not 100% sure, but I think we may shutdown too soon the MessageCenter. When I remove the call to SafeExecute(messageCenter.BlockApplicationMessages) from Silo.Terminate(bool gracefully) the test seems to pass everytime.

The test might also fail because of SiloUnavailableException. Hopefully, this will be fixed soon :)

gabikliot commented 7 years ago

Good catch, this is indeed most probably the cause. The msg is queued for rerouting, but now there is a race between this rerouting and the message center shutting down.

The fix is far from trivial. You don't want just to remove messageCenter.BlockApplicationMessages, cause it has a purpose by itself. Even if you did, it will only shorten the race but not solve it, since as part of rerouting we may need to do directory lookup (almost sure will do, since the grain does not have an activation). That means we need to wait until the rerouting (which involves async activities, such as directory lookup) to fully finish. To do that properly we will need to expose those rerouted msg from the SafeExecute(() => catalog.DeactivateAllActivations().WaitWithThrow(stopTimeout));, and wait on a Task associated with their rerouting. Quite an invasive change, across multiple layers.

Conceptually, I think this is the right thing to do: to be more strict in the silo about different silo-internal shutdown stages and move between them in lock step (I am not suggesting distributed lock step, only within the silo w.r.t. to silo internal activities). But that will not be easy. Let me know if you want to embark on that journey and we can brainstorm how exactly.

jdom commented 7 years ago

Related to this talk about the current (implicit) shutdown stages, some time ago with @jason-bragg we were talking (and prototyping) something for having different explicit startup stages, as well as different shutdown stages, and each component could hookup to one or more of those. Because as you mention here, sometimes it's not enough to just binary start or stop different big components (ie: all bootstrap providers start after X was started), as they need to go in lockstep with other components.

jason-bragg commented 7 years ago

For clarification, @jdom and I had discussed some of these issues before, but no serious prototyping was done in Orleans proper. When discussing what the .Net Core compliant Orleans 2.0 might look like I created a sandbox to play in and prototype. In this sandbox I modeled off of ASP.net and explored some ideas, one of which is startup/shutdown sequence. This work can be found in https://github.com/jason-bragg/Oxygen. Specifically the Microsoft.Orleans.Lifecycle and Microsoft.Orleans.RingLifecycle namespaces. Used in Orleans/Test/TinkerHostTests/TinkerHostLifecycleTests.cs.

This prototyping was not part of any official Orleans work, this was just me tinkering. I named it Oxygen because Orleans 2 => O2, which is a molecular oxygen molecule. I wanted to call it Ozone but that's O3, so I'll have to wait a while. :)

benjaminpetit commented 7 years ago

Yes the fix doesn't seems easy, but I guess we need to do it :) But I am not sure if we should design/implement the "shutdown stages" now or first fix this bug (even if it will be a big change too).

Removing messageCenter.BlockApplicationMessages does not prevent some race conditions, but is it a required step for graceful shutdown? Because the documentation of the method says (maybe the description is wrong):

/// Indicates that application messages should be blocked from being sent or received.
/// This method is used by the "fast stop" process.

(Again, I know, this is not enough if we just remove that - just trying to understand why we do it)

gabikliot commented 7 years ago

but I guess we need to do it

Why do we "need to" do it? Is there some concrete scenario when this causes troubles, or just a general desire to close all cracks?

jdom commented 7 years ago

Several teams are looking at autoscaling, where the graceful shutdown scenario will occur very often, that's why we are prioritizing making this path a little more resilient. We want to fulfill the guarantees we thought we had, where absolutely no message gets lost, nor any call takes the timeout timespan to fail, and also with a non (automatically) retryable exception (since from that exception there is no guarantee that the message wasn't processed). This is ok when there's exceptional network issues, but shouldn't be OK in a non exceptional scenario such as scaling in, patching, other planned shutdowns

gabikliot commented 7 years ago

There are 2 reasons for BlockApplicationMessages: 1) stop receiving new application messages from other silos (messages from client are blocked differently, by stopping the GW via StopAcceptingClientMessages). 2) stop sending out new application msgs.

In theory, we can leave the condition number 1 and remove number 2. I see no harm at all on leaving condition number 1. It should not impact our ability to do the resend.

Now the question is why do we need number 2 at all. First, as we said, that by itself is not enough to fix your problem. But still, why do we need it at all? Frankly, I think back when we added it I was not thinking about the reroute case. So it seemed correct to close the gates as soon as possible. I think I am OK with changing it in the following way: BlockApplicationMessagesReceive. if (!gracefully) BlockApplicationMessagesSend

In the non gracefull case, the activations are still there and may still want to send msgs (in the gracefull are we already deactivated them all), so we want to block their send.

gabikliot commented 7 years ago

@jdom, have you guys thought about how to test this? Here is an idea. Take our existing reliability nightly tests, make sure we are shutting down the silos and not killing them, and then change the test client to expect 100% msg success. Currently it allows message loss during first X minutes after silo stop (X being like a minute or 2 I think).

gabikliot commented 7 years ago

There was no answer here, so I assume that is not critical. But if you guys do need to improve the reliability of the shutdown process (try to make it 100% bullet proof), we can talk about that. Should be doable, but not a small work item.

jdom commented 7 years ago

Hi Gabi, sorry for the lack of reply, but yes, this is very important and @benjaminpetit was trying to get some tests for #2298 without any luck. Nevertheless, I do agree about the suggestion for having some tests in our reliability tests too. Of course the merged PR wouldn't pass those tests, as it didn't tackle the overall problem of graceful shutdown, just made it a little bit more robust during both graceful and ungraceful shutdown.

gabikliot commented 7 years ago

You can create a small version of reliability test in unit test where u just send requests all the time to 3 silos all the time (just 100 grains) and then shutdown one and check no msg was lost. It will probably fail, but you will have a repro and then can start investigating.