Closed cljohnso closed 8 years ago
What would be preferable to all of this messing around with shell-specific (both version and lineage) behavior is to determine a better way for galvan to force kill the entire process tree as that was the original reason for the initial script modification. It is what we do on Windows since we had no other option.
While that has its own sort of oddities (depending on the environment of how the shell script is invoked, that might be insufficient to force all sub-processes to terminate) but it will be consistent with previous versions and the behavior we would see (or at least how we approach it) on Windows.
Given that Windows already needs this (which is why we also need the ability to look up the inferior process PID), in galvan, the corresponding Unix approach would be to invoke kill, directly: kill -- -<PID>
(note the extra "-" when compared to kill <PID>
).
This would then allow us to revert the shell script to its original shape.
Coming up with a portable way to do this, without some special native, is pretty tricky. The shells on different platforms appear to implement kill
differently and other tools, like ps
, also differ.
I still prefer reverting the start-tc-server.sh
script to a simpler form and moving the complexity into galvan, though, so I am continuing to dig into this. I am looking into some JNA (ipc-eventbus already uses it, under galvan) to interact with killpg
while invoking the script in a way which may allow us to treat it as its own group. So far, this isn't working: most attempts to start the sub-process in its own group (script
, for example) does the right thing when testing in a shell but not when run under the VM.
My current thinking is that we could find a portable invocation of ps
which would allow us to get PID
, PPID
, and command
(most likely we could just invoke ps -o pid,ppid,command
since that seems to avoid any extenions - in isolation, it works on both Linux and OS X). Each galvan ServerProcess could then inject an additional, unique parameter to the end of the start-tc-server.sh
command (since this appears to not cause any problems and is passed through to TCServerMain
). This is so that we could use it as a unique eye-catcher for parsing ps
output for the invocation in order to identify a specific PID
to signal. If the additional argument is questionable, we could use the server name, but only if we could further ensure that it would be unique across concurrent tests on the same machine.
These are all ugly work-arounds but they remove all of the complexity from the start-tc-server.sh
script, isolating testing framework concerns purely within galvan.
I had luck with approach mentioned in the last comment: injecting an eye-catcher into the TCServerMain
, scraping it out of ps
output, and then calling kill
, directly. This uses a version of start-tc-server.sh
which reverts all the changes to run in the background and relay the signals with traps so it reverts us back to 4.x expectations, which is much more reliable. The only changes left in the script were related to some white-space handling support, added more recently.
I will do some more in-depth testing and post a PR for core and galvan, tomorrow morning.
The code in
start-tc-server.sh
relies on the shellwait
command to coordinate exit of the script with server termination. Unfortunately, aTERM
signal to thestart-tc-server.sh
process causes thewait
to terminate early (while properly propagatingTERM
to the background server). (From the Bash documentation ... "When Bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed.") So, although server process termination is initiated, it is not complete when thewait
completes and, in addition, thewait
exit code does not reflect the exit code of the server process.The sample Bash code below can be used to demonstrate both the interrupted
wait
and one possible way of correcting the issue. The sample solution depends on proper operation ofkill -0 $PID
-- while this is part of the POSIX standard *NIX, there may be some systems for which this does not operation properly (see "The Zero Signal" in http://www.linux.org/threads/kill-signals-and-commands-revised.8096/). A solution relying onkill -0
will need to be tested on each platform we support. It may be possible to avoid the use ofkill -0
by relying on retryingwait
until a 127 ("not an active child") is returned.The following is the test script output from a sample run with the TERM handler enabled in the server (background) process -- as it would be in our Java server case:
Note the
exitValue
from the following TERM signal processing intryWait
-- 143. This is 128 +SIGTERM
(15). The real background process exit code isn't observed for several more seconds and after a secondwait
fromtryWait
-- 7.The following sample output is from the scripts with the TERM handler in the server (background) process disabled. This is included to demonstrate that the
wait
exit code is the same when indicating a SIGTERM-interruptedwait
orwait
picking up the exit code of a process without a TERM handler.TryWaitDriver
TryWait
tryWaitBackground