Terracotta-OSS / terracotta-core

The core components of the Terracotta Server
Apache License 2.0
64 stars 58 forks source link

start-tc-server.sh does not accurately indicate server termination #237

Closed cljohnso closed 8 years ago

cljohnso commented 8 years ago

The code in start-tc-server.sh relies on the shell wait command to coordinate exit of the script with server termination. Unfortunately, a TERM signal to the start-tc-server.sh process causes the wait to terminate early (while properly propagating TERM to the background server). (From the Bash documentation ... "When Bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed.") So, although server process termination is initiated, it is not complete when the wait completes and, in addition, the wait exit code does not reflect the exit code of the server process.

The sample Bash code below can be used to demonstrate both the interrupted wait and one possible way of correcting the issue. The sample solution depends on proper operation of kill -0 $PID -- while this is part of the POSIX standard *NIX, there may be some systems for which this does not operation properly (see "The Zero Signal" in http://www.linux.org/threads/kill-signals-and-commands-revised.8096/). A solution relying on kill -0 will need to be tested on each platform we support. It may be possible to avoid the use of kill -0 by relying on retrying wait until a 127 ("not an active child") is returned.


The following is the test script output from a sample run with the TERM handler enabled in the server (background) process -- as it would be in our Java server case:

$ HANDLE_TERM=true ./tryWaitDriver
0 ./tryWaitDriver starting ./tryWait
0 ./tryWaitDriver pausing 5 seconds before sending TERM to ./tryWait PID=58584
0 ./tryWait starting ... PID=58584
0 ./tryWait Started 58586; rc=0
0 ./tryWait Entering wait 58586
0 ./tryWaitBackground started; PID=58586, PPID=58584
0 ./tryWaitBackground entering sleep 30
5 ./tryWaitDriver sending TERM to ./tryWait PID=58584
5 ./tryWaitDriver waiting for ./tryWait PID=58584 to complete
5 ./tryWait Processing TERM signal; sending TERM to 58586
5 ./tryWait Exited wait 58586; exitValue=143            <== wait interrupted by SIGTERM
5 ./tryWait Entering wait 58586                         <== re-driving wait after interruption (process not done)
30 ./tryWaitBackground received TERM
30 ./tryWaitBackground exiting with rc=7
30 ./tryWait Exited wait 58586; exitValue=7             <== wait completed by background process exit
30 ./tryWait Terminated server; rc=7
30 ./tryWait Exiting; rc=7                              <== Exiting with *real* exit code
30 ./tryWaitDriver wait complete; rc=7

Note the exitValue from the following TERM signal processing in tryWait -- 143. This is 128 + SIGTERM (15). The real background process exit code isn't observed for several more seconds and after a second wait from tryWait -- 7.


The following sample output is from the scripts with the TERM handler in the server (background) process disabled. This is included to demonstrate that the wait exit code is the same when indicating a SIGTERM-interrupted wait or wait picking up the exit code of a process without a TERM handler.

$ HANDLE_TERM=false ./tryWaitDriver
0 ./tryWaitDriver starting ./tryWait
0 ./tryWaitDriver pausing 5 seconds before sending TERM to ./tryWait PID=58648
0 ./tryWait starting ... PID=58648
0 ./tryWait Started 58650; rc=0
0 ./tryWait Entering wait 58650
0 ./tryWaitBackground started; PID=58650, PPID=58648
0 ./tryWaitBackground entering sleep 30
5 ./tryWaitDriver sending TERM to ./tryWait PID=58648
5 ./tryWaitDriver waiting for ./tryWait PID=58648 to complete
5 ./tryWait Processing TERM signal; sending TERM to 58650
5 ./tryWait Exited wait 58650; exitValue=143            <== wait interrupted by SIGTERM
5 ./tryWait Entering wait 58650                         <== re-driving wait after interruption (process not done)
5 ./tryWait Exited wait 58650; exitValue=143            <== wait completed by background process exit
5 ./tryWait Terminated server; rc=143
5 ./tryWait Exiting; rc=143
5 ./tryWaitDriver wait complete; rc=143

TryWaitDriver

#!/bin/bash
# Tests signal handling in BASH.  This script is the driver sscript for the
# ./tryWait and ./tryWaitBackground scripts used to demonstrate the effect of
# SIGTERM on a bash wait command.
#
# Run as 
#   ./tryWaitDriver
# or
#   HANDLE_TERM=true ./tryWaitDriver
# enables the TERM signal handler in ./tryWaitBackground.
# Run as
#   HANDLE_TERM=false ./tryWaitDriver
# omits the TERM single handler in ./tryWaitBackground.
#
# When run with the TERM handler in tryWaitBackground enabled, the script shows that wait 
# can be terminated by SIGTERM and the wait exit code reflects interruption of the wait 
# command and *not* termination of waited-upon (tryWaitBackground) process.  When the TERM 
# handler in tryWaitBackground is disabled, the script shows that wait is again still 
# terminated by SIGTERM with the wait exit code reflecting the wait interruption.  The code 
# in tryWait differs from the code in start-tc-server.sh in that the tryWait version repeats
# the wait (in a loop) until a "kill -0" indicates the background (server) process no longer
# exists.  This causes the wait command to be re-driven and the *second* wait provides the 
# background process exit code.
#

echo "$SECONDS $0 starting ./tryWait"
./tryWait $SECONDS &
PID=$!

echo "$SECONDS $0 pausing 5 seconds before sending TERM to ./tryWait PID=$PID"
sleep 5
echo "$SECONDS $0 sending TERM to ./tryWait PID=$PID"
kill -TERM $PID
echo "$SECONDS $0 waiting for ./tryWait PID=$PID to complete"
wait $PID
echo "$SECONDS $0 wait complete; rc=$?"

TryWait

#!/bin/bash

if [ $1 ]; then
  SECONDS=$1
fi

echo "$SECONDS $0 starting ... PID=$$"

PID=0
function cleanup_TERM {
  echo "$SECONDS $0 Processing TERM signal; sending TERM to $PID"
  kill -TERM $PID
}
trap cleanup_TERM SIGTERM

start=true
while $start; do
  ./tryWaitBackground $SECONDS &
  PID=$!
  rc=$?
  echo "$SECONDS $0 Started $PID; rc=$rc"

  # Loop on wait until the background process is no longer alive.
  while kill -0 $PID 2>/dev/null ; do
    echo "$SECONDS $0 Entering wait $PID"
    wait $PID
    exitValue=$?
    echo "$SECONDS $0 Exited wait $PID; exitValue=$exitValue"
  done

  start=false

  echo "$SECONDS $0 Terminated server; rc=$exitValue"
done

echo "$SECONDS $0 Exiting; rc=$exitValue"
exit $exitValue

tryWaitBackground

#!/bin/bash
delay=30

if [ $1 ]; then
  SECONDS=$1
fi

echo "$SECONDS $0 started; PID=$$, PPID=$PPID"

if [ "$HANDLE_TERM" == "" -o "$HANDLE_TERM" == "true" ]; then
  function handle_TERM {
    echo "$SECONDS $0 received TERM"
    echo "$SECONDS $0 exiting with rc=7"
    exit 7
  }
  trap handle_TERM SIGTERM
fi

echo "$SECONDS $0 entering sleep $delay"
sleep $delay
echo "$SECONDS $0 exiting sleep $delay; rc=$?"

echo "$SECONDS $0 exiting with rc=2"
exit 2
jd0-sag commented 8 years ago

What would be preferable to all of this messing around with shell-specific (both version and lineage) behavior is to determine a better way for galvan to force kill the entire process tree as that was the original reason for the initial script modification. It is what we do on Windows since we had no other option.

While that has its own sort of oddities (depending on the environment of how the shell script is invoked, that might be insufficient to force all sub-processes to terminate) but it will be consistent with previous versions and the behavior we would see (or at least how we approach it) on Windows.

jd0-sag commented 8 years ago

Given that Windows already needs this (which is why we also need the ability to look up the inferior process PID), in galvan, the corresponding Unix approach would be to invoke kill, directly: kill -- -<PID> (note the extra "-" when compared to kill <PID>).

This would then allow us to revert the shell script to its original shape.

jd0-sag commented 8 years ago

Coming up with a portable way to do this, without some special native, is pretty tricky. The shells on different platforms appear to implement kill differently and other tools, like ps, also differ.

I still prefer reverting the start-tc-server.sh script to a simpler form and moving the complexity into galvan, though, so I am continuing to dig into this. I am looking into some JNA (ipc-eventbus already uses it, under galvan) to interact with killpg while invoking the script in a way which may allow us to treat it as its own group. So far, this isn't working: most attempts to start the sub-process in its own group (script, for example) does the right thing when testing in a shell but not when run under the VM.

My current thinking is that we could find a portable invocation of ps which would allow us to get PID, PPID, and command (most likely we could just invoke ps -o pid,ppid,command since that seems to avoid any extenions - in isolation, it works on both Linux and OS X). Each galvan ServerProcess could then inject an additional, unique parameter to the end of the start-tc-server.sh command (since this appears to not cause any problems and is passed through to TCServerMain). This is so that we could use it as a unique eye-catcher for parsing ps output for the invocation in order to identify a specific PID to signal. If the additional argument is questionable, we could use the server name, but only if we could further ensure that it would be unique across concurrent tests on the same machine.

These are all ugly work-arounds but they remove all of the complexity from the start-tc-server.sh script, isolating testing framework concerns purely within galvan.

jd0-sag commented 8 years ago

I had luck with approach mentioned in the last comment: injecting an eye-catcher into the TCServerMain, scraping it out of ps output, and then calling kill, directly. This uses a version of start-tc-server.sh which reverts all the changes to run in the background and relay the signals with traps so it reverts us back to 4.x expectations, which is much more reliable. The only changes left in the script were related to some white-space handling support, added more recently.

I will do some more in-depth testing and post a PR for core and galvan, tomorrow morning.