Bug in Orleans silo shutdown

yoniabrahamy commented 9 years ago

Hello all, I'm working on a new persistent stream provider for Orleans (Kafka based), and after updating Orleans to version 1.0.10 I noticed that during my tests (and being more precise: at the end of my tests) the shutdown of the Orleans testing silo takes a significant amount of time. After digging a bit inside I found out that the Silo.Terminate is stuck for a long time on function that closes the stream providers. After digging a bit more inside I discovered that the problem is in the PersistentStreamProvider shutdown. The PersistentStreamPullingManager is trying to tell the PullingAgent (in my test I have only one physical queue, so only one PullingAgent) to shutdown but it fails to do so, and eventually (after 30 seconds) we get a timeout of the operation.

My suspicion is that because in Silo.Terminate the termination of the providers is relatively at the end of the function, the function has already terminated parts of Orleans that are necessary for the termination of the stream providers (in specific the PersistentStreamProviders).

Changing the order of termination in the function did make the stream provider shutdown to be completed successfully, but unfortunately I don't have enough time or knowledge of Orleans inner works to know what is the correct order of termination in Orleans. So I'm asking for your help :sweat_smile:

This is the current order of shutdown in Silo.Terminate (highlighted the stream provider shutdown):

Write "ShutDown" state in the table + broadcast gossip msgs to re-read the table to everyone
Stop reminder service
Deactivate all grains
Stop the gateway
Start rejecting all silo to silo application messages
Stop scheduling/executing application turns
Directory: Speed up directory handoff (this does not do anything, it says it will happen automatically when directory receives SiloStatusChagnedNotification)
Stopped preparation completion (this actually has no comment on it, just guessing from the name of the function)
Statistics providers shutdown
Stream providers shutdown
Bootstrap providers shutdown
Storage provider shutdown

Thanks for the help

gabikliot commented 9 years ago

This is your own persistent stream provider, so can you tell what the code is doing in the shutdown function? Also, can you tell what part exactly is getting stuck?

gabikliot commented 9 years ago

The logic in the shutdown sequence is supposed to be the reverse logic of the startup. Thus, we first stop all application activities, then stop the providers, then stop all system activities, which is done later in SafeKill function. The agent system target is considered system activity. So the intent at least was that the current sequence is the right one. But maybe as part of application shutdown we also stop something that is required for providers to stop. For example, if the provider tries to send msg to a grain as part of shutdown, it won't work, since we already stopped application grains.

Sergey, I think we do have a test with shutdown that also involves stream provider shutdown being called. azure queue provider and SMS do nothing in its shutdown call, but the test should check that at least everything is called and executed correctly . Worth double checking and making sure we have such a test.

So I would suggest spending a bit time to understand what dependency is shut down prematurely before changing the order. We may decide eventually that providers have to be shutdown before the application grains, but i would like to understand if this is the problem first.

yoniabrahamy commented 9 years ago

You can run the sample streaming tests in the Orleans solution, the problem with the shutdown will occur there as well. My guess is that the PersistentStreamPullingManager is communicating with the PullingAgents via grain messages and that won't work because the application grains are already stopped.

gabikliot commented 9 years ago

Hmmm, pulling agent is a system target, so it should have worked. But on the other hand, its what we call low priority system target from scheduler perspective,so maybe there is some mix up there, and the scheduler stops scheduling its turns prematurely by mistake.

We definitely need a test for that, which will explicitly fail if this doesn't work properly.

yoniabrahamy commented 9 years ago

Again I said I'm not such an expert in Orleans inner workings, so my guess can be wrong. I will just say that after moving the Stream Provider shutdown to be number 3 in the terminate function shutdown list, the stream provider was shut down properly (don't know about the rest of the providers or mechanisms of Orleans).

gabikliot commented 9 years ago

I am not saying you are wrong. I actually think it is very much possible, and have a direction about what could be causing it.

But it would help a lot to have a specific test to check that behavior. Such a test should explicitly fail if the shutdown does not execute correctly. Just watching the progress in our other tests is not good enough to automatically test this. The provider would probably need to signal back to the test that it was called or not.

One way this can be done by writing a dummy stream provider that does nothing except to call on some marshal by ref object back into the test. The provider can be defined in place in the test. Overall, I think it can be an easy test to write.

gabikliot commented 9 years ago

Looked at the code now. Pretty sure that the bug is indeed when we stop application turns in the scheduler, we stop all non system work groups. The problem is that pulling agent is considered non system at this place, since it is a low priority system target. The fix would be instead of looking at IsSystem or not in work item group, look at scheduling context (the key in the concurrent dictionary in the Orleans task scheduler) and check if it is system target or activation. Basically need to treat scheduling priority different from whether it is application context or not.

yevhen commented 9 years ago

Hi @gabikliot

I've observed this behavior with stock AQProvider. From 1.0.10.

sergeybykov commented 9 years ago

I'm thinking, regardless of the scheduling priority, wouldn't it make more sense to move step 10 to between 2 and 3? If stream providers keep pumping messages, they'll continue triggering new activations of grains. So step 3 is kind of meaningless.

A similar issue with step 4 being after 3. If we don't stop the gateway, the silo keeps receiving new requests that trigger new activations.

In other words, wouldn't it make more sense to try to stop all inputs into the silo first: gateway, stream providers, reminders, before deactivating grains?

yevhen commented 9 years ago

make sense to me. We stop our servers exactly like that.

I would say it's a pattern: shut the door first. :)

21 окт. 2015 г., в 18:10, Sergey Bykov notifications@github.com написал(а):

I'm thinking, regardless of the scheduling priority, wouldn't it make more sense to move step 10 to between 2 and 3? If stream providers keep pumping messages, they'll continue triggering new activations of grains. So step 3 is kind of meaningless.

A similar issue with step 4 being after 3. If we don't stop the gateway, the silo keeps receiving new requests that trigger new activations.

In other words, wouldn't it make more sense to try to stop all inputs into the silo first: gateway, stream providers, reminders, before deactivating grains?

— Reply to this email directly or view it on GitHub.

yevhen commented 9 years ago

damn phone :)

I meant: close the door and shut the windows, before flushing the toilet :))

21 окт. 2015 г., в 18:10, Sergey Bykov notifications@github.com написал(а):

I'm thinking, regardless of the scheduling priority, wouldn't it make more sense to move step 10 to between 2 and 3? If stream providers keep pumping messages, they'll continue triggering new activations of grains. So step 3 is kind of meaningless.

A similar issue with step 4 being after 3. If we don't stop the gateway, the silo keeps receiving new requests that trigger new activations.

In other words, wouldn't it make more sense to try to stop all inputs into the silo first: gateway, stream providers, reminders, before deactivating grains?

— Reply to this email directly or view it on GitHub.

jason-bragg commented 9 years ago

I agree with @sergeybykov and @yevhen regarding shutting down inputs first. I think we've encountered this because stream providers are unique in that they are the only 'active' providers we have. That is, they are the only providers that actively generate work rather than passively responding to application layer logic. Rather then treating this as a one-off, I propose we introduce the idea of active providers into the system, modeled somewhat off of the way controllable providers (IControllable) are supported.

Consider an interface Interface IActiveWorker { Task Start(); Task Stop(); }

During silo startup we start all providers that inherit from IActiveWorker. During silo shutdown, prior to turning off all grains, we stop all providers that inherit from IActiveWorker.

IActiveWorker would signal an optional behavior that could be placed on any type of provider which allows that provider to take part in the silo's lifecycle in a manner appropriate for inputs into the system.

gabikliot commented 9 years ago

Ok, my answers: First i think we need a test. Not having one was the reason we didn't notice it. Just observing without failing test is good but not enough. Second, for Jason idea, we already have unit and close on all providers. Third, we do first shut the door Yevgeny, with step 4 and 5. Forth, if you stop gw before deactivation you will break a scenario when grain notifies its observer or closes stream in deactivate. To remind you Sergey, that scenario exactly for Reed was the original reason for deactivate in shutdown. Fifth, Sergey, no, they won't reactivate grains, due to step 5. And even if step 5 is not good somehow, has a bug or what ever, we can harden it by raising a global flag in the catalog: Don't create. And we might already have it, by moving into shutdown mbr state. I think it tells catalog not to activate any more

Last, based on all above, I wouldn't rush "lets just change the order". The first impression, even mine, to this issue was lets just change, but than I kept thinking about it and recalled all the thinking that was put already in the current sequence. Not saying we shouldn't change it under any means, but suggest thinking thoroughly first.

gabikliot commented 8 years ago

Was able to reproduce the problem and validate that the root cause is as I suspected. The silo logs has:

Dropping work item [Request WorkItem Name=RequestWorkItem:Id=75 global::Orleans.Streams.IPersistentStreamPullingAgent:Shutdown(), Ctx=null]: [LowPrioritySystemTarget: S127.0.0.1:22222:184921932PullingAgentSystemTarget/FF/ed38edaa@Sed38edaa] -> Request S127.0.0.1:22222:184921932PullingAgentsManagerSystemTarget/FE/97bd94a3@S97bd94a3->S127.0.0.1:22222:184921932PullingAgentSystemTarget/FF/ed38edaa@Sed38edaa #75: global::Orleans.Streams.IPersistentStreamPullingAgent:Shutdown() because applicaiton turns are stopped.

gabikliot commented 8 years ago

Fixed via #1015.

As for the order of shutdown operations, the current silo shutdown sequence is: a) stop application b) stop providers c) stop the system

Considerations for this order: a) storage provider should definitely be stopped after application (grains) and not before, to all deactivating grains to flush their state to storage. b) stream providers should also be stopped after application (grains) and not before, to allow deactivating grains to send last pieces of state to streams, maybe signal stream completion. If we stop stream providers before grains are deactivated, the grain logic that may be using streams inside deactivate will throw, which basically means some operation are allowed and some not in different stage of the grain, which will break our abstractions.

Based on the above I don't believe "wouldn't it make more sense to move step 10 to between 2 and 3? "

gabikliot commented 8 years ago

The bug was fixed via #1015 .

dotnet / orleans

Bug in Orleans silo shutdown #930