acassen / keepalived

Keepalived
https://www.keepalived.org
GNU General Public License v2.0
4k stars 736 forks source link

Last status of misc check sometimes not updated in keepalived_check.data #2412

Closed pwloc closed 5 months ago

pwloc commented 6 months ago

Describe the bug I've got virtual server configured and real_server has the following misc check configured. The check returns 0 or 1 that can be found in the file or 2 if the file is not present.

        MISC_CHECK {
            delay_loop 10
            misc_path "/bin/bash -c 'exit $(cat /tmp/x1 2>/dev/null || echo 2 )'"
            log_all_failures true
        }

Whenever exit code changes between 2 and 1 (or greater value) the change is not present in keepalived_check.data. Of course data file is dumped again. "Last ran" gets updated but "Last status" not

   Last ran = 1715087852.956447 (Tue May  7 13:17:32 2024.956447)
   Last status = 2

When exit code changes from 0 to 1 or 2 (or vice versa) there is no problem.

To Reproduce

  1. configure real_server with misc check mentioned above
  2. echo 2 > /tmp/x1
  3. wait for the check to run
  4. dump data file with kill -SIGUSR1
  5. check "Last status" in data file should be 2
  6. echo 1 > /tmp/x1
  7. wait for the check to run
  8. dump data file with kill -SIGUSR1
  9. check "Last status" in my case it's still 2 but I would expect it to be 1

Expected behavior keepalived_check.data is updated every time check's exit code changes.

Keepalived version

# keepalived -v
Keepalived v2.2.8 (05/31,2023)

Copyright(C) 2001-2023 Alexandre Cassen, <acassen@gmail.com>

Built with kernel headers for Linux 6.3.0
Running on Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024
Distro: Alpine Linux v3.18

configure options: --disable-dynamic-linking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --localstatedir=/var --mandir=/usr/share/man --enable-bfd --enable-snmp --enable-snmp-rfc --disable-iptables --enable-json

Config options:  LVS VRRP VRRP_AUTH VRRP_VMAC JSON BFD OLD_CHKSUM_COMPAT SNMP_V3_FOR_V2 SNMP_VRRP SNMP_CHECKER SNMP_RFCV2 SNMP_RFCV3 INIT=systemd

System options:  VSYSLOG MEMFD_CREATE IPV6_MULTICAST_ALL IPV4_DEVCONF LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS FRA_L3MDEV FRA_UID_RANGE RTAX_FASTOPEN_NO_COOKIE RTA_VIA FRA_PROTOCOL FRA_IP_PROTO FRA_SPORT_RANGE FRA_DPORT_RANGE RTA_TTL_PROPAGATE IFA_FLAGS LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA NET_LINUX_IF_H_COLLISION NETINET_LINUX_IF_ETHER_H_COLLISION LIBIPTC_LINUX_NET_IF_H_COLLISION LIBIPVS_NETLINK IPVS_DEST_ATTR_ADDR_FAMILY IPVS_SYNCD_ATTRIBUTES IPVS_64BIT_STATS IPVS_TUN_TYPE IPVS_TUN_CSUM IPVS_TUN_GRE VRRP_IPVLAN IFLA_LINK_NETNSID INET6_ADDR_GEN_MODE VRF SO_MARK

Distro (please complete the following information):

Details of any containerisation or hosted service (e.g. AWS) Alpine running in docker container running on Ubuntu 22.04.2

pqarmitage commented 6 months ago

If the script returns a non-zero exit code (and the checker does not have misc_dynamic enabled), the last_status is only updated if the previous exit code was 0 (i.e. success). In other words, for a non-dynamic checker the "Last status" is only updated when the checker changes status from success to failure or vice versa. The reason for this is that the exit code changing from 1 to 2, or 2 to 1, or 100 to 1, etc, is not changing the status of the checker, i.e. it remains in the failed state. The detail is in the _misc_check_childthread() function in the keepalived/check/check_misc.c source file.

keepalived is functioning correctly, other than the Last status reported in the diagnostic output may be misleading.

If it is important for you that the Last status field in the diagnostic output is more than just, in essence, either zero or non-zero, could you please explain why, and I will consider changing the code. If you are using SUGUSR1 to write keepalived_check.data in order to check the last exit code returned by your script, it would be much more efficient for the script to write the exit code it will return to a diagnostic file.

pwloc commented 6 months ago

Thanks for explaining the behavior. What I'm trying to do is to monitor the state of keepalived's checks so I could know not only if but also why given real server is down. Misc check is not my only check. I also use BFD check. Therefore I prefer to get all checks status from one place that is from keepalived. Moreover, this way I get real checks' state not the state of diagnostic files. I agree that check status (up/down) is more important but exit code provides some extra troubleshooting info, at least in my case. Last status is present in the state file anyway so having it updated like "Last ran" is would be great.

Btw. have you considered including checks' statuses (up/down at least) in the SNMP MIB ? Dumping and parsing state files doesn't seem to be the best method for such use case and I guess is not intended for that purpose.

pqarmitage commented 6 months ago

@pwloc I hadn't considered adding checker information to the SNMP MIB (I find adding extra SNMP code very tedious!). I wonder if it would be more appropriate to use traps to notify status changes. I would appreciate your thoughts on this.

If you or someone else would like to modify the code to add the extra SNMP functionality and submit a pull request, it would be very helpful.

pwloc commented 6 months ago

I understand. SNMP is not easy protocol to work with. As for SNMP traps they don't fit my architecture. I need a place to query for checks status. I would have to add traps listener to hold the state. Prefer to stick with data file.

As for implementing SNMP features, I'm affraid that you overestimate my C skills. I'll see If I could find someone else but I find chances slim.

pqarmitage commented 5 months ago

Commit a4258a6 now updates the exit code if it changes, even if the checker is already down.