FIFO process seems to be killed prematurely before stop command terminates

cfiehe commented 9 months ago

Describe the bug I am not sure, if it is a bug or if we do something wrong. We are using the FIFO sample from here: https://github.com/acassen/keepalived/blob/master/doc/samples/sample_notify_fifo.sh.

We hoped that this approach could be used to ensure that our stopping script gets executed completely and does not get killed prematurely when Keepalived stops. Unfortunately, it does not seem to work or we are missing something vital.

To Reproduce We use:

global_defs {
    enable_script_security
    max_auto_priority 30
    notification_email {
        info@mycompany.com
    }
    notification_email_from noreply-loc@mycompany.com
    router_id instance1
    smtp_connect_timeout 30
    smtp_server 127.0.0.1
    vrrp_version 3

    notify_fifo /tmp/notify_fifo
    notify_fifo_script /etc/keepalived/scripts/sample_notify_fifo.sh
}
include /etc/keepalived/conf.d/*.conf

And added the following to the FIFO loop in order to simulate a longer running command when Keepalived stops:

if [[ $STATE = STOP ]]; then
    sleep 5
    echo "Command is finished" >>$LOG_FILE
fi

while [[ $SHUTDOWN -eq 0 ]]
do
    [[ ! -p $FIFO ]] && echo FIFO $FIFO missing >>$LOG_FILE && exit 1
    exec <$FIFO
    [[ $? -ne 0 || $(stat -c "%i" $FIFO 2>/dev/null) -ne $FIFO_INODE ]] && break

    while read line; do
        PROLOGUE=$(echo "$(date +"%a %b %e %X %Y")": \[$PPID:$$\])
        set $line
        TYPE=$1
        if [[ $TYPE = INSTANCE || $TYPE = GROUP ]]; then
            VRRP_INST=${2//\"/}
            STATE=$3
            PRIORITY=$4

            # Now take whatever action is required
            echo "$PROLOGUE" $TYPE $VRRP_INST $STATE $PRIORITY >>$LOG_FILE

            if [[ $STATE = STOP ]]; then
                sleep 5
                echo "Command is finished" >>$LOG_FILE
            fi
        elif [[ $TYPE = VS ]]; then
            VS=$2
            STATE=$3

            # Now take whatever action is required
            echo "$PROLOGUE" $TYPE $VS $STATE >>$LOG_FILE
        elif [[ $TYPE = RS ]]; then
            RS=$2
            VS=$3
            STATE=$4

            # Now take whatever action is required
            echo "$PROLOGUE" $TYPE $RS $VS $STATE >>$LOG_FILE
        else
            echo "$PROLOGUE" $TYPE - unknown "($*)" >>$LOG_FILE
        fi
    done

    [[ $SHUTDOWN -eq 0 ]] && echo "$PROLOGUE" FIFO CLOSED >>$LOG_FILE
done

Expected behavior We expect that the process is not killed before the command has finished. The FIFO log gives us the following after starting and stopping Keepalived:

Fri Jan 26 12:29:17 PM 2024: [52927:52928] INSTANCE docker BACKUP 100
Fri Jan 26 12:29:21 PM 2024: [52927:52928] INSTANCE docker MASTER 105
Fri Jan 26 12:29:25 PM 2024: [52927:52928] INSTANCE docker STOP 105

The log entry Command is finished is missing in the log because the FIFO process seems to be terminated prematurely. Also the subsequent stop command in the sample is never executed.

Keepalived version

Keepalived v2.2.4 (08/21,2021)

Copyright(C) 2001-2021 Alexandre Cassen, <acassen@gmail.com>

Built with kernel headers for Linux 5.15.27
Running on Linux 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023
Distro: Ubuntu 22.04.3 LTS

configure options: --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --enable-snmp --enable-sha1 --enable-snmp-rfcv2 --enable-snmp-rfcv3 --enable-dbus --enable-json --enable-bfd --enable-regex --with-init=systemd build_alias=x86_64-linux-gnu CFLAGS=-g -O2 -ffile-prefix-map=/build/keepalived-NeItXh/keepalived-2.2.4=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security LDFLAGS=-Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects -flto=auto -Wl,-z,relro CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2

Config options:  NFTABLES LVS REGEX VRRP VRRP_AUTH VRRP_VMAC JSON BFD OLD_CHKSUM_COMPAT SNMP_V3_FOR_V2 SNMP_VRRP SNMP_CHECKER SNMP_RFCV2 SNMP_RFCV3 DBUS INIT=systemd SYSTEMD_NOTIFY

System options:  VSYSLOG MEMFD_CREATE IPV4_DEVCONF LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS FRA_L3MDEV FRA_UID_RANGE RTAX_FASTOPEN_NO_COOKIE RTA_VIA FRA_PROTOCOL FRA_IP_PROTO FRA_SPORT_RANGE FRA_DPORT_RANGE RTA_TTL_PROPAGATE IFA_FLAGS LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA NET_LINUX_IF_H_COLLISION LIBIPVS_NETLINK IPVS_DEST_ATTR_ADDR_FAMILY IPVS_SYNCD_ATTRIBUTES IPVS_64BIT_STATS IPVS_TUN_TYPE IPVS_TUN_CSUM IPVS_TUN_GRE VRRP_IPVLAN IFLA_LINK_NETNSID GLOB_BRACE GLOB_ALTDIRFUNC INET6_ADDR_GEN_MODE VRF SO_MARK

Distro (please complete the following information):

Name: Ubuntu
Version: 22.04
Architecture: x86_64

Details of any containerisation or hosted service (e.g. AWS) If keepalived is being run in a container or on a hosted service, provide full details

Configuration file:

global_defs {
    enable_script_security
    max_auto_priority 30
    notification_email {
        info@mycompany.com
    }
    notification_email_from noreply-loc@mycompany.com
    router_id instance1
    smtp_connect_timeout 30
    smtp_server 127.0.0.1
    vrrp_version 3

    notify_fifo /tmp/notify_fifo
    notify_fifo_script /etc/keepalived/scripts/sample_notify_fifo.sh
}
include /etc/keepalived/conf.d/*.conf

track_file docker {
    file /etc/keepalived/prio-track-files/docker.prio
    init_file 0
    weight 10
}
vrrp_script check_docker {
    script /usr/libexec/keepalived/check_docker.sh
    interval 5
    timeout 5
    fall 2
    rise 3
    weight 5
}
vrrp_instance docker {
    advert_int 1
    interface ens192
    priority 100
    state MASTER
    track_file {
        docker
    }
    track_script {
        check_docker
    }
    unicast_src_ip A.B.C.67
    unicast_peer {
        A.B.C.68
    }
    virtual_ipaddress {
        A.B.C.66/24
    }
    virtual_router_id 75
#    notify "/usr/libexec/keepalived/backup-notify.sh"
}

What can we do to ensure the the FIFO process is not being killed until the command terminates?

pqarmitage commented 9 months ago

keepalived works hard to ensure that it does not leave behind any process it created, but also allows time for any processes it created, such as the notify fifo script, to terminate.

The detail of what keepalived does is

When keepalived starts shutting down, it sends SIGTERM to all the processes that it created.
For each child process that does not terminate after 1 second, it then sends SIGKILL.

So you have 1 second in which to close down, provided that you catch SIGTERM. If your script does not catch SIGTERM then it should die immediately it receives the SIGTERM.

So I think your best solution is to catch SIGTERM, and in the signal handler execute another script. So you could add something like the following:

trap shutdown SIGTERM

shutdown()
{
    /etc/keepalived/scripts/shutdown.sh &
    exit 0
}

and in /etc/keepalived/scripts/shutdown.sh do whatever you want to do that takes more than 1 second.

I haven't tested this, but I think it shold work.

Any feedback would be appreciated.

cfiehe commented 9 months ago

Hi @pqarmitage, thanks a lot. I have added the code snippet in order to catch the SIGTERM, but the result is the same. The script is only able to execute some steps of the stopping procedure and gets killed prematurely before regular termination. I have to stop a container and unmount some devices. This takes approximately 2 or 3 seconds.

pqarmitage commented 9 months ago

The key thing is to run another script in background from the SIGTERM signal handler. This script being run in background will not be terminated by keepalived, and can take whatever actions you need for however long it takes.

As an example I have modified sample_notify_fifo.sh with the following patch:

diff --git a/doc/samples/sample_notify_fifo.sh b/doc/samples/sample_notify_fifo.sh
index ecda8f3..c061eef 100755
--- a/doc/samples/sample_notify_fifo.sh
+++ b/tmp/sample_notify_fifo.sh
@@ -66,6 +66,8 @@ start_shutdown()
        ( sleep 0.5
          kill -ALRM $$ 2>/dev/null
        ) &
+
+       /tmp/handle_shutdown.sh &
 }

 trap stopping HUP INT QUIT USR1 USR2 PIPE ALRM

/tmp/handle_shutdown.sh:

#!/bin/bash

for i in $(seq 1 10); do
    echo $(date) - run $i >>/tmp/shutdown.log
    sleep 1
done

exit 0

When keepalived shuts down, I have the following keepalived log entries:

Sat Jan 27 16:02:41.350581032 2024: (docker) sent 0 priority
Sat Jan 27 16:02:41.350608480 2024: (docker) removing VIPs.
Sat Jan 27 16:02:42.361933472 2024: Stopped - used (self/children) 0.004787/0.004658 user time, 0.015321/0.022698 system time

The contents of /tmp/shutdown.log are:

Sat 27 Jan 16:02:41 GMT 2024 - run 1
Sat 27 Jan 16:02:42 GMT 2024 - run 2
Sat 27 Jan 16:02:43 GMT 2024 - run 3
Sat 27 Jan 16:02:44 GMT 2024 - run 4
Sat 27 Jan 16:02:45 GMT 2024 - run 5
Sat 27 Jan 16:02:46 GMT 2024 - run 6
Sat 27 Jan 16:02:47 GMT 2024 - run 7
Sat 27 Jan 16:02:48 GMT 2024 - run 8
Sat 27 Jan 16:02:49 GMT 2024 - run 9
Sat 27 Jan 16:02:50 GMT 2024 - run 10

This shows that the handle_shutdown.sh script continues to run (in this case for 8 seconds) after keepalived has terminated.

cfiehe commented 9 months ago

The main problem with a background process is that there is the theoretical risk of a race condition when (re-)start and shutdown overlap because no sequential order is guaranteed out of the box.

I am experimenting with a solution where the notify handler is controlled by a separate systemd unit and runs in it own scope. Coupling between Keepalived and the notify handler is done via the fifo queue. In that case Keepalived does not manage the process lifecycle of the notify handler.

What do you think: Does this solution have any drawbacks?

pqarmitage commented 9 months ago

I can't think of any drawbacks to your approach, and when I designed the FIFO interface I very much had in mind the sort of scenario you describe, as well as the one where keepalived runs the fifo reading script as you were doing.

acassen / keepalived

FIFO process seems to be killed prematurely before stop command terminates #2372