Open benjaminpetit opened 7 years ago
Lets check in this test, with properly marked throwing exception if it indeed fails, and then I will take a deeper look. From a brief look I took now, sounds like you are correct, the 2nd call should have succeeded (HostedCluster.StopSilo
is actually doing shutdown -> confusing. The grain in non reentrant.)
I moved the test in another branch, removed the functional category tag and sent this PR #2292
In https://github.com/dotnet/orleans/pull/2292#issuecomment-253379088 @benjaminpetit wrote:
I am not 100% sure, but I think we may shutdown too soon the MessageCenter. When I remove the call to SafeExecute(messageCenter.BlockApplicationMessages) from Silo.Terminate(bool gracefully) the test seems to pass everytime.
The test might also fail because of SiloUnavailableException. Hopefully, this will be fixed soon :)
Good catch, this is indeed most probably the cause. The msg is queued for rerouting, but now there is a race between this rerouting and the message center shutting down.
The fix is far from trivial. You don't want just to remove messageCenter.BlockApplicationMessages, cause it has a purpose by itself. Even if you did, it will only shorten the race but not solve it, since as part of rerouting we may need to do directory lookup (almost sure will do, since the grain does not have an activation). That means we need to wait until the rerouting (which involves async activities, such as directory lookup) to fully finish. To do that properly we will need to expose those rerouted msg from the SafeExecute(() => catalog.DeactivateAllActivations().WaitWithThrow(stopTimeout));, and wait on a Task associated with their rerouting. Quite an invasive change, across multiple layers.
Conceptually, I think this is the right thing to do: to be more strict in the silo about different silo-internal shutdown stages and move between them in lock step (I am not suggesting distributed lock step, only within the silo w.r.t. to silo internal activities). But that will not be easy. Let me know if you want to embark on that journey and we can brainstorm how exactly.
Related to this talk about the current (implicit) shutdown stages, some time ago with @jason-bragg we were talking (and prototyping) something for having different explicit startup stages, as well as different shutdown stages, and each component could hookup to one or more of those. Because as you mention here, sometimes it's not enough to just binary start or stop different big components (ie: all bootstrap providers start after X was started), as they need to go in lockstep with other components.
For clarification, @jdom and I had discussed some of these issues before, but no serious prototyping was done in Orleans proper. When discussing what the .Net Core compliant Orleans 2.0 might look like I created a sandbox to play in and prototype. In this sandbox I modeled off of ASP.net and explored some ideas, one of which is startup/shutdown sequence. This work can be found in https://github.com/jason-bragg/Oxygen. Specifically the Microsoft.Orleans.Lifecycle and Microsoft.Orleans.RingLifecycle namespaces. Used in Orleans/Test/TinkerHostTests/TinkerHostLifecycleTests.cs.
This prototyping was not part of any official Orleans work, this was just me tinkering. I named it Oxygen because Orleans 2 => O2, which is a molecular oxygen molecule. I wanted to call it Ozone but that's O3, so I'll have to wait a while. :)
Yes the fix doesn't seems easy, but I guess we need to do it :) But I am not sure if we should design/implement the "shutdown stages" now or first fix this bug (even if it will be a big change too).
Removing messageCenter.BlockApplicationMessages
does not prevent some race conditions, but is it a required step for graceful shutdown? Because the documentation of the method says (maybe the description is wrong):
/// Indicates that application messages should be blocked from being sent or received.
/// This method is used by the "fast stop" process.
(Again, I know, this is not enough if we just remove that - just trying to understand why we do it)
but I guess we need to do it
Why do we "need to" do it? Is there some concrete scenario when this causes troubles, or just a general desire to close all cracks?
Several teams are looking at autoscaling, where the graceful shutdown scenario will occur very often, that's why we are prioritizing making this path a little more resilient. We want to fulfill the guarantees we thought we had, where absolutely no message gets lost, nor any call takes the timeout timespan to fail, and also with a non (automatically) retryable exception (since from that exception there is no guarantee that the message wasn't processed). This is ok when there's exceptional network issues, but shouldn't be OK in a non exceptional scenario such as scaling in, patching, other planned shutdowns
There are 2 reasons for BlockApplicationMessages: 1) stop receiving new application messages from other silos (messages from client are blocked differently, by stopping the GW via StopAcceptingClientMessages). 2) stop sending out new application msgs.
In theory, we can leave the condition number 1 and remove number 2. I see no harm at all on leaving condition number 1. It should not impact our ability to do the resend.
Now the question is why do we need number 2 at all. First, as we said, that by itself is not enough to fix your problem. But still, why do we need it at all? Frankly, I think back when we added it I was not thinking about the reroute case. So it seemed correct to close the gates as soon as possible. I think I am OK with changing it in the following way: BlockApplicationMessagesReceive. if (!gracefully) BlockApplicationMessagesSend
In the non gracefull case, the activations are still there and may still want to send msgs (in the gracefull are we already deactivated them all), so we want to block their send.
@jdom, have you guys thought about how to test this? Here is an idea. Take our existing reliability nightly tests, make sure we are shutting down the silos and not killing them, and then change the test client to expect 100% msg success. Currently it allows message loss during first X minutes after silo stop (X being like a minute or 2 I think).
There was no answer here, so I assume that is not critical. But if you guys do need to improve the reliability of the shutdown process (try to make it 100% bullet proof), we can talk about that. Should be doable, but not a small work item.
Hi Gabi, sorry for the lack of reply, but yes, this is very important and @benjaminpetit was trying to get some tests for #2298 without any luck. Nevertheless, I do agree about the suggestion for having some tests in our reliability tests too. Of course the merged PR wouldn't pass those tests, as it didn't tackle the overall problem of graceful shutdown, just made it a little bit more robust during both graceful and ungraceful shutdown.
You can create a small version of reliability test in unit test where u just send requests all the time to 3 silos all the time (just 100 grains) and then shutdown one and check no msg was lost. It will probably fail, but you will have a repro and then can start investigating.
If I'm not mistaken, when a silo properly shutdown (not stop) it should:
It seems that step 2. is not properly done. In this test https://github.com/benjaminpetit/orleans/commit/53631d95df931e6c416170d92bd6b4caeb0b1f45 the
await promisesAfterShutdown
throwsTimeoutException
Am I missing something?