activation.type: launch - proxy_sample gets terminated if we manually kill out of process module

kiranpradeep commented 7 years ago

On Ubuntu 14.04 LTS, samples/proxy_sample application will crash if manually kill( kill -9 pid ) the launched process(proxy_sample_remote). This happens only if we had used launch options(activation.type: launch).

If we try with activation.type:none and kill the _proxy_sampleremote process, the gateway process(proxy_sample) continues with out impact.

zafields commented 7 years ago

Are you able to kill without -9?

kiranpradeep commented 7 years ago

@zafields Yes. I am able to kill with out -9 and same crash repeats. Copying terminal below

username@machine_name:~/kiran/azure-iot-gateway-sdk/build$ ./samples/proxy_sample/proxy_sample ./launch_sample_lin.json 
gateway successfully created from JSON
gateway shall run until ENTER is pressed
Error: Time:Fri Jun 30 10:35:44 2017 File:/home/862537/kiran/azure-iot-gateway-sdk/proxy/outprocess/src/module_loaders/outprocess_module.c Func:Outprocess_Destroy Line:955 unable to send destroy control message [0x7fb89bd6afc0], continuing with module destroy
username@machine_name:~/kiran/azure-iot-gateway-sdk/build$

zafields commented 7 years ago

I will need to investigate, and I will get back to you.

zafields commented 7 years ago

I'm still looking into this... I'm setting up a repro environment. I've also been looking into #320 at the same time to see if there is some sort of relationship.

zafields commented 7 years ago

@darobs I've added you to this issue for a morning brainstorming session about the best resolution to this behavior. @kiranpradeep I have confirmed there is no relationship to #320 and should have this fixed before long.

zafields commented 7 years ago

@kiranpradeep After researching the behavior and discussion with the team, we have determined the behavior you are experiencing is by design.

When activation.type: "launch" is specified, it is assumed the gateway has full control of the life cycle of the module, exactly as if the module was running in the same process. Signaling an out-of-process module with an activation type of launch, is treated the same as if an in-process module received a signal or encountered an error (i.e. bad memory access). When activation.type: "none" is specified, then you are assumed to be in control of the life-cycle of the out-of-process module and you can signal, kill, exit the program as you wish.

I believe the real problem is the unhelpful error message you are receiving and I have created an issue #339 to track it.

We acknowledge there is a lot of room for richer functionality in this area, and we would be interested to hear your specific use case and needs to see if we can model your feedback into a future enhancement.

kiranpradeep commented 7 years ago

@zefields Thanks. Is there any advantage in decision that activation.type: "launch" had to match an in-proc module behavior - Why not let the user decide(param/callback) to abort or continue in case an out-of-proc module fails?

Usecase: I had multiple data acquisition modules( GPS, BLE etc ) on gateway. I didn't wanted a crash in one of the modules to bring down all of the modules and so choose to run as out-of-proc. Felt it would be clean, if a single gateway/controller, started all data acquisition modules. May be I am not using "launch" for the functionality, it was intended to be.

zafields commented 7 years ago

You are the consumer, so you ARE definitely using it correctly! 😄 In fact, you have identified a use case that we want to support, but we were having difficulty deciding how to expose this functionality to users. Once you start exposing this functionality it seems to cascade into an enormous set of parameters to cover all use cases.

I understand your scenario, you did a great job of describing it. However, I have a couple clarifying questions to help us understand your needs. In this scenario, how would your gateway be impacted if a module dropped and you chose to continue and not to abort? Also, if you were notified that an out-of-proc module had died and you elected to have the gateway continue, what would your next steps be? Given your example, how does a gateway carry-on without a module? How do you respond to this information? Would you want to restart the module? What tools to you need from us to make your gateway achieve its purpose/goal?

kiranpradeep commented 7 years ago

@zafields

No. I don't want iot-edge to restart a module. That would also make iot-edge, some sort of big init system - which IMHO is not an iot-edge responsibility.
All data acquisition modules are just feeding data to message bus and death of one of modules has no impact other data acquisition modules. But, on the notification of a dead out-of-proc module( say GPS ), one of modules - iothub client(azure-iot-sdk-c), is supposed to raise an alarm event to Azure IotHub. This alarm is supposed to trigger a set of actions on server, like remote trouble shooting. If iot-edge, chooses to bring down all modules together, we don't stand chance of raising an alarm. It is one of the remaining alive modules(iothub-client), which facilitates a remote trouble shooting channel back into gateway. Also other data acquisition modules like BLE could still continue to transmit data to Azure via iothub-client.
I am happy with tools provided by iot-edge and doesn't need much more. But azure-iot-sdk-c, is harder and they push most requests to enhancements which I don't see happening any time soon.

kiranpradeep commented 7 years ago

@zafields On a second thought, the request for notification on out-of-proc module death( point 2 ), was a request specific for my application needs. I now think, iot-edge is not a process monitoring system and so notification is not a responsibility of iot-edge. iot-edge already gives us a way to talk between ourselves(publish/receive), and with that we (lib users), could build our own mechanisms to see who is alive or dead.

But, what we expect is 1) Don't kill all of us, because one of was bad. Let us live, and decide for ourselves to do what we like. If you shut us down, on the very first out-of-proc module death, we cannot do any thing. 2) "activation-type": "launch" should launch all good out-of-proc modules, ignoring any one who cannot load themselves in "grace.period.ms" time. No restarting. No monitoring. Ignore dead or not responding modules.

zafields commented 7 years ago

@kiranpradeep Great insight! Let me run it around the yard and get back to you. Cheers!

Azure / iot-edge-v1

activation.type: launch - proxy_sample gets terminated if we manually kill out of process module #326