Closed yoniabrahamy closed 8 years ago
This is your own persistent stream provider, so can you tell what the code is doing in the shutdown function? Also, can you tell what part exactly is getting stuck?
The logic in the shutdown sequence is supposed to be the reverse logic of the startup. Thus, we first stop all application activities, then stop the providers, then stop all system activities, which is done later in SafeKill function. The agent system target is considered system activity. So the intent at least was that the current sequence is the right one. But maybe as part of application shutdown we also stop something that is required for providers to stop. For example, if the provider tries to send msg to a grain as part of shutdown, it won't work, since we already stopped application grains.
Sergey, I think we do have a test with shutdown that also involves stream provider shutdown being called. azure queue provider and SMS do nothing in its shutdown call, but the test should check that at least everything is called and executed correctly . Worth double checking and making sure we have such a test.
So I would suggest spending a bit time to understand what dependency is shut down prematurely before changing the order. We may decide eventually that providers have to be shutdown before the application grains, but i would like to understand if this is the problem first.
You can run the sample streaming tests in the Orleans solution, the problem with the shutdown will occur there as well. My guess is that the PersistentStreamPullingManager
is communicating with the PullingAgents via grain messages and that won't work because the application grains are already stopped.
Hmmm, pulling agent is a system target, so it should have worked. But on the other hand, its what we call low priority system target from scheduler perspective,so maybe there is some mix up there, and the scheduler stops scheduling its turns prematurely by mistake.
We definitely need a test for that, which will explicitly fail if this doesn't work properly.
Again I said I'm not such an expert in Orleans inner workings, so my guess can be wrong. I will just say that after moving the Stream Provider shutdown to be number 3 in the terminate function shutdown list, the stream provider was shut down properly (don't know about the rest of the providers or mechanisms of Orleans).
I am not saying you are wrong. I actually think it is very much possible, and have a direction about what could be causing it.
But it would help a lot to have a specific test to check that behavior. Such a test should explicitly fail if the shutdown does not execute correctly. Just watching the progress in our other tests is not good enough to automatically test this. The provider would probably need to signal back to the test that it was called or not.
One way this can be done by writing a dummy stream provider that does nothing except to call on some marshal by ref object back into the test. The provider can be defined in place in the test. Overall, I think it can be an easy test to write.
Looked at the code now. Pretty sure that the bug is indeed when we stop application turns in the scheduler, we stop all non system work groups. The problem is that pulling agent is considered non system at this place, since it is a low priority system target. The fix would be instead of looking at IsSystem or not in work item group, look at scheduling context (the key in the concurrent dictionary in the Orleans task scheduler) and check if it is system target or activation. Basically need to treat scheduling priority different from whether it is application context or not.
Hi @gabikliot
I've observed this behavior with stock AQProvider. From 1.0.10.
I'm thinking, regardless of the scheduling priority, wouldn't it make more sense to move step 10 to between 2 and 3? If stream providers keep pumping messages, they'll continue triggering new activations of grains. So step 3 is kind of meaningless.
A similar issue with step 4 being after 3. If we don't stop the gateway, the silo keeps receiving new requests that trigger new activations.
In other words, wouldn't it make more sense to try to stop all inputs into the silo first: gateway, stream providers, reminders, before deactivating grains?
make sense to me. We stop our servers exactly like that.
I would say it's a pattern: shut the door first. :)
21 окт. 2015 г., в 18:10, Sergey Bykov notifications@github.com написал(а):
I'm thinking, regardless of the scheduling priority, wouldn't it make more sense to move step 10 to between 2 and 3? If stream providers keep pumping messages, they'll continue triggering new activations of grains. So step 3 is kind of meaningless.
A similar issue with step 4 being after 3. If we don't stop the gateway, the silo keeps receiving new requests that trigger new activations.
In other words, wouldn't it make more sense to try to stop all inputs into the silo first: gateway, stream providers, reminders, before deactivating grains?
— Reply to this email directly or view it on GitHub.
damn phone :)
I meant: close the door and shut the windows, before flushing the toilet :))
21 окт. 2015 г., в 18:10, Sergey Bykov notifications@github.com написал(а):
I'm thinking, regardless of the scheduling priority, wouldn't it make more sense to move step 10 to between 2 and 3? If stream providers keep pumping messages, they'll continue triggering new activations of grains. So step 3 is kind of meaningless.
A similar issue with step 4 being after 3. If we don't stop the gateway, the silo keeps receiving new requests that trigger new activations.
In other words, wouldn't it make more sense to try to stop all inputs into the silo first: gateway, stream providers, reminders, before deactivating grains?
— Reply to this email directly or view it on GitHub.
I agree with @sergeybykov and @yevhen regarding shutting down inputs first. I think we've encountered this because stream providers are unique in that they are the only 'active' providers we have. That is, they are the only providers that actively generate work rather than passively responding to application layer logic. Rather then treating this as a one-off, I propose we introduce the idea of active providers into the system, modeled somewhat off of the way controllable providers (IControllable) are supported.
Consider an interface Interface IActiveWorker { Task Start(); Task Stop(); }
During silo startup we start all providers that inherit from IActiveWorker. During silo shutdown, prior to turning off all grains, we stop all providers that inherit from IActiveWorker.
IActiveWorker would signal an optional behavior that could be placed on any type of provider which allows that provider to take part in the silo's lifecycle in a manner appropriate for inputs into the system.
Ok, my answers: First i think we need a test. Not having one was the reason we didn't notice it. Just observing without failing test is good but not enough. Second, for Jason idea, we already have unit and close on all providers. Third, we do first shut the door Yevgeny, with step 4 and 5. Forth, if you stop gw before deactivation you will break a scenario when grain notifies its observer or closes stream in deactivate. To remind you Sergey, that scenario exactly for Reed was the original reason for deactivate in shutdown. Fifth, Sergey, no, they won't reactivate grains, due to step 5. And even if step 5 is not good somehow, has a bug or what ever, we can harden it by raising a global flag in the catalog: Don't create. And we might already have it, by moving into shutdown mbr state. I think it tells catalog not to activate any more
Last, based on all above, I wouldn't rush "lets just change the order". The first impression, even mine, to this issue was lets just change, but than I kept thinking about it and recalled all the thinking that was put already in the current sequence. Not saying we shouldn't change it under any means, but suggest thinking thoroughly first.
Was able to reproduce the problem and validate that the root cause is as I suspected. The silo logs has:
Dropping work item [Request WorkItem Name=RequestWorkItem:Id=75 global::Orleans.Streams.IPersistentStreamPullingAgent:Shutdown(), Ctx=null]: [LowPrioritySystemTarget: S127.0.0.1:22222:184921932PullingAgentSystemTarget/FF/ed38edaa@Sed38edaa] -> Request S127.0.0.1:22222:184921932PullingAgentsManagerSystemTarget/FE/97bd94a3@S97bd94a3->S127.0.0.1:22222:184921932PullingAgentSystemTarget/FF/ed38edaa@Sed38edaa #75: global::Orleans.Streams.IPersistentStreamPullingAgent:Shutdown() because applicaiton turns are stopped.
Fixed via #1015.
As for the order of shutdown operations, the current silo shutdown sequence is: a) stop application b) stop providers c) stop the system
Considerations for this order: a) storage provider should definitely be stopped after application (grains) and not before, to all deactivating grains to flush their state to storage. b) stream providers should also be stopped after application (grains) and not before, to allow deactivating grains to send last pieces of state to streams, maybe signal stream completion. If we stop stream providers before grains are deactivated, the grain logic that may be using streams inside deactivate will throw, which basically means some operation are allowed and some not in different stage of the grain, which will break our abstractions.
Based on the above I don't believe "wouldn't it make more sense to move step 10 to between 2 and 3? "
The bug was fixed via #1015 .
Hello all, I'm working on a new persistent stream provider for Orleans (Kafka based), and after updating Orleans to version 1.0.10 I noticed that during my tests (and being more precise: at the end of my tests) the shutdown of the Orleans testing silo takes a significant amount of time. After digging a bit inside I found out that the
Silo.Terminate
is stuck for a long time on function that closes the stream providers. After digging a bit more inside I discovered that the problem is in thePersistentStreamProvider
shutdown. ThePersistentStreamPullingManager
is trying to tell the PullingAgent (in my test I have only one physical queue, so only onePullingAgent
) to shutdown but it fails to do so, and eventually (after 30 seconds) we get a timeout of the operation.My suspicion is that because in
Silo.Terminate
the termination of the providers is relatively at the end of the function, the function has already terminated parts of Orleans that are necessary for the termination of the stream providers (in specific the PersistentStreamProviders).Changing the order of termination in the function did make the stream provider shutdown to be completed successfully, but unfortunately I don't have enough time or knowledge of Orleans inner works to know what is the correct order of termination in Orleans. So I'm asking for your help :sweat_smile:
This is the current order of shutdown in
Silo.Terminate
(highlighted the stream provider shutdown):Thanks for the help