Stop SMO Zombie Processes when Creator Call Chain Goes Idle

jimkring commented 7 years ago

Currently, an SMO will turn into a "zombie" and wait forever if the call chain of the creator goes idle -- this happens because LabVIEW garbage collects all DVRs, Event References, Refnums, etc. created in the constructor. The "Wait for Stop or Abort" Event Structure, shown below, will wait forever and never time out.

2016-10-03_18-21-59

The symptom of this is that if the user's main VI is aborted (e.g. hard stop or exits without stopping/destroying the SMOs) then the SMO's will remain locked, due to the fact that there are still running Process VIs. The only way to unlock your SMOs is to close the projects or restart LabVIEW.

A solution, would be for the process to respond to the invalid refnums and stop waiting.

2016-10-04_16-56-19

However, this solution of simply finishing waiting, will not stop all the child processes. And, unfortunately, the AbortProcesses and StopProcesses events are now invalid (because they've been garbage collected), so there's no good way to message into the child process loops.

Because we can't message into the child process loops (and their DVRs will the SMO's data are gone too), there's probably not a lot of point to try to gracefully exit. So, a reasonable solution is to kill the zombie is to shoot it in the brain, i.e. call the Stop function primitive.

2016-10-04_17-20-47

This works well (at least it's easier for users than closing their project or restarting LabVIEW).

One improvement that could be made to this solution would be, instead of polling the reference validity, would be to wait in an event-driven fashion. My recommendation would be to create a single-element queue in the SMO constructor and wait on this queue (to go invalid) in parallel to the "Wait for Stop or Abort" Event Structure.

Here's a possible event-driven solution (that does not require polling the DVR):

2016-10-05_09-51-04

I'm curious what others think about this.

drjdpowell commented 7 years ago

In “Messenger Library” I used the second solution: a separate watchdog that waits on a queue who’s only purpose is to signal by its destruction that the calling code has gone idle. The watchdog then sends an “autoshutdown” message to the actor. This eliminates the issue of “zombie” processes. In fact, I usually don’t bother with explicit shutdown code for most actors.

BTW, why is your SMO’s DVR created, and thus its lifetime tied to, the calling code? If it is created in the SMO (and passed back to the caller) then you wouldn’t have to “shoot it in the brain”. It could shut itself down cleanly.

jimkring commented 7 years ago

Hi @drjdpowell - yes, I like your suggestion of having the references created in the actor and we were just discussing that, this morning. At this point, it's a bit of an architectural change, but I think it's doable. We would need to tweak how/where the start/stop operations and states are handled.

jesseroow commented 7 years ago

I would like to vote this up - I was going to create an issue if no one had already.

Perhaps I am a bad developer, but in practice I end up using the abort when developing. A key part of the value add for using a pre-built framework like SMO is good cleanup and watchdog capabilities built in. Also, as long as zombies exist the framework has the potential to hang executables on exit as well.

francois-normandin commented 7 years ago

I like the idea of having the process create the references... in theory. For the particular case of SMOs, there is nothing that prevents the developer from using the object without starting the process. In such a use case, the references would still need to be created outside of the main process, so it seems an overcomplication of the framework to support this in the base class.

For example, one thing that the framework allows me to do is to create public methods that will be synchronous if the process is not started (aka wrap the protected/private method), and asynchronous if the process is started (aka send a message to the process to perform the action). This makes creating unit tests very easy as one can test all the functionality without having to launch a process. It also makes it simple to create a library that make use of the SMO framework for dependency management without having to design an object as an active object. Enforcing to launch a process as part of the creation of the object makes the object active before the developer choses it to be. IMO, it overconstrains the framework by enforcing a choice that a developer might want to do without. Granted, the behavior is desirable in most use cases, but my point is that it is not desirable in 100% of the use cases.

For that reason, I'd recommend that we go with Jim's suggestion of using a queue and that terminating the process be the way to go for get rid of zombie VIs. In addition, I'd add a hook in that loop to allow a user to override a method that can be used to log of notify the user that this happened.

Having the top-level app close before all its parts are closed is the real issue here: it may happen during development and it is an annoyance to have to close the project, so it needs to be handled somehow. But having runaway processes in the final application is a problem that should be addressed by making sure the parts are destroyed before the top-level app goes out of memory, not simply by discarding them when it happens.

Graceful shutdown can be achieved by using byValue design for the inner process workings or by using hooks in the onStart() override to request critical references to be created by the process on startup. The onProcessStarted() override can then be used to make sure those critical references were all created successfully by the process. I'd prefer to see examples on how to design such applications for the community to develop best practices, rather than making the base class enforce and constrain the user to a certain workflow.

My general philosophy about SMOs: The base classes are agnostic to byValue/byReference designs, active objects or not. They are, at all times, an hybrid of both and should never impose a choice to the developer who extends the base classes. If there is a way to add features without impacting the developer's choice in that matter, then by all means, otherwise these features should be in more specific implementations (or templates) and not part of the base package.

francois-normandin commented 7 years ago

Addition of ProcessKill watchdog will be available in next build.

onProcessKill ( ) override for creating hooks to shutdown processes gracefully
onProcessKilled ( ) override for creating notifications/logging in the event of a process kill event
new public method to adjust (byValue) the "Process Kill timeout" (default = 500ms)

drjdpowell commented 7 years ago

I haven’t looked at your dependency stuff; I thought this was only an “active object”/“actor” framework.

It also makes it simple to create a library that make use of the SMO framework for dependency management without having to design an object as an active object.

Shouldn’t your top-level parent class be “Dependency”, with “SMO” as a child, then? With "Dependency:Create Dependency” and “SMO:Start” methods?

This makes creating unit tests very easy as one can test all the functionality without having to launch a process.

If one has “Request-Reply” (reply to message, that is) then it is easy to use synchronous “Send message and wait for reply” to write tests on the running process, while still receiving the replies asynchronously in the application. If your process launches and communicates with subprocesses, then you need a running process to properly test. Easy testing is one of the advantages of true “Request-Reply” over “Request-Response” (which seems to be the more commonly used pattern in LabVIEW frameworks).

francois-normandin commented 7 years ago

Shouldn’t your top-level parent class be “Dependency”, with “SMO” as a child, then? With "Dependency:Create Dependency” and “SMO:Start” methods?

I understand what you mean and I've used this pattern in the past as well. Semantically speaking, SMOs are not always dependencies of other SMOs, so it might be confusing for beginners to understand the relationships, even though it would be abstracted. SMOs can aggregate other SMOs easily enough that the extra layer of inheritance is not mandatory for self-consistency.

I thought this was only an “active object”/“actor” framework.

It is one of many things :-) I personally use this SMO base class as the parent for almost everything, from low-level driver (byValue, no process) to systems, servers, UIs, LUTs, etc. SMOs can be integrated into other actor frameworks without much overhead as well, so each SMO can be run headless (i.e. not be a dependency of another SMO, although they would be a dependency of whatever owns their lifetime, of course).

Easy testing is one of the advantages of true “Request-Reply” over “Request-Response” (which seems to be the more commonly used pattern in LabVIEW frameworks).

I agree. So far I've resisted the urge to add those mechanisms as part of the base class. Do I always need an asynchronous messaging pattern? Obviously not. Does it hurt to make the base class larger to make this pattern available if I need it? Probably not, in most cases. And that's where I'm not sure... What if it matters 1% of the time? Will I branch out the repo, rename the base class and strip it of all irrelevant pieces I don't want to use? That seems senseless. What if instead one extends the SMO base class and creates a Request-Reply-ready class to inherit from? That's the way I've gone so far, but I'm open to be persuaded otherwise. Perhaps another thread :-)

JKISoftware / JKI-State-Machine-Objects

Stop SMO Zombie Processes when Creator Call Chain Goes Idle #25