acassen / keepalived

Keepalived
https://www.keepalived.org
GNU General Public License v2.0
4.01k stars 736 forks source link

Keepalived STATE change frequently happening - Keepalived fluctuating #2469

Closed Adi-AA closed 1 month ago

Adi-AA commented 2 months ago

Describe the issue A clear and concise description of what the issue is. Keepalived keeps changing STATE (MASTER to BACKUP and vice-versa) frequently causing disruption to nginx-plus load balnaced traffic in our prod env. Due to this we are seeing health check failures to our nginx upstreams even though the servers are healthy. To Reproduce Any steps necessary to reproduce the behaviour: It is currently happening, I have the logs. Expected behavior A clear and concise description of what you expected to happen. I would like to have keepalived stabilized without frequent STATE changes and to stabilize our prod env. Keepalived version Output of keepalived -v [root@esdfwlbp0004 keepalived]# keepalived -v Keepalived v2.2.8 (04/04,2023), git commit v2.2.7-154-g292b299e+

Copyright(C) 2001-2023 Alexandre Cassen, acassen@gmail.com

Built with kernel headers for Linux 4.18.0 Running on Linux 4.18.0-553.16.1.el8_10.x86_64 #1 SMP Thu Aug 1 04:16:12 EDT 2024 Distro: Red Hat Enterprise Linux 8.10 (Ootpa)

configure options: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --enable-bfd --disable-lvs --disable-snmp --with-init=none build_alias=x86_64-redhat-linux-gnu host_alias=x86_64-redhat-linux-gnu PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig CFLAGS=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection LDFLAGS=-Wl,-z,relro -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld

Config options: VRRP VRRP_AUTH VRRP_VMAC BFD OLD_CHKSUM_COMPAT INIT=none

System options: VSYSLOG MEMFD_CREATE IPV4_DEVCONF RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS FRA_L3MDEV FRA_UID_RANGE RTAX_FASTOPEN_NO_COOKIE RTA_VIA FRA_PROTOCOL FRA_IP_PROTO FRA_SPORT_RANGE FRA_DPORT_RANGE RTA_TTL_PROPAGATE IFA_FLAGS LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA NET_LINUX_IF_H_COLLISION LIBIPTC_LINUX_NET_IF_H_COLLISION VRRP_IPVLAN IFLA_LINK_NETNSID GLOB_BRACE GLOB_ALTDIRFUNC INET6_ADDR_GEN_MODE VRF SO_MARK [root@esdfwlbp0004 keepalived]#

[root@esdfwlbp0003 keepalived]# keepalived -v Keepalived v2.2.8 (04/04,2023), git commit v2.2.7-154-g292b299e+

Copyright(C) 2001-2023 Alexandre Cassen, acassen@gmail.com

Built with kernel headers for Linux 4.18.0 Running on Linux 4.18.0-553.16.1.el8_10.x86_64 #1 SMP Thu Aug 1 04:16:12 EDT 2024 Distro: Red Hat Enterprise Linux 8.10 (Ootpa)

configure options: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --enable-bfd --disable-lvs --disable-snmp --with-init=none build_alias=x86_64-redhat-linux-gnu host_alias=x86_64-redhat-linux-gnu PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig CFLAGS=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection LDFLAGS=-Wl,-z,relro -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld

Config options: VRRP VRRP_AUTH VRRP_VMAC BFD OLD_CHKSUM_COMPAT INIT=none

System options: VSYSLOG MEMFD_CREATE IPV4_DEVCONF RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS FRA_L3MDEV FRA_UID_RANGE RTAX_FASTOPEN_NO_COOKIE RTA_VIA FRA_PROTOCOL FRA_IP_PROTO FRA_SPORT_RANGE FRA_DPORT_RANGE RTA_TTL_PROPAGATE IFA_FLAGS LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA NET_LINUX_IF_H_COLLISION LIBIPTC_LINUX_NET_IF_H_COLLISION VRRP_IPVLAN IFLA_LINK_NETNSID GLOB_BRACE GLOB_ALTDIRFUNC INET6_ADDR_GEN_MODE VRF SO_MARK [root@esdfwlbp0003 keepalived]#

Distro (please complete the following information):

Details of any containerisation or hosted service (e.g. AWS) If keepalived is being run in a container or on a hosted service, provide full details N/A Configuration file: A full copy of the configuration file, obfuscated if necessary to protect passwords and IP addresses [root@esdfwlbp0003 keepalived]# cat /etc/keepalived/keepalived.conf global_defs { vrrp_version 3 }

vrrp_script chk_manual_failover { script "/usr/libexec/keepalived/nginx-ha-manual-failover" interval 10 weight 50 }

vrrp_script chk_nginx_service { script "/usr/libexec/keepalived/nginx-ha-check" interval 3 weight 50 }

vrrp_instance VI_1 { interface ens192 priority 101 virtual_router_id 51 advert_int 1 accept garp_master_refresh 5 garp_master_refresh_repeat 1 unicast_src_ip 10.162.84.5 unicast_peer { 10.162.84.6 } virtual_ipaddress { 10.162.84.7 10.162.84.8 10.162.84.9 10.162.84.10 10.162.84.11 10.162.84.12 10.162.84.13 10.162.84.14 10.162.84.15 10.162.84.16 10.162.84.17 10.162.84.18 10.162.84.19 10.162.84.20 10.162.84.21 10.162.84.22 10.162.84.23 10.162.84.24 10.162.84.25 10.162.84.26 10.162.84.27 10.162.84.28 10.162.84.29 10.162.84.30 10.162.84.31 10.162.84.32 10.162.84.33 10.162.84.34 10.162.84.35 10.162.84.36 10.162.84.37 10.162.84.38 10.162.84.39 10.162.84.40 10.162.84.41 10.162.84.42 10.162.84.43 10.162.84.44 10.162.84.45 10.162.84.46 10.162.84.47 10.162.84.48 10.162.84.49 10.162.84.50 10.162.84.51 10.162.84.52 10.162.84.53 10.162.84.54 10.162.84.55 10.162.84.56 10.162.84.57 10.162.84.58 10.162.84.59 10.162.84.60 10.162.84.61 10.162.84.62 10.162.84.63 10.162.84.64 10.162.84.65 10.162.84.66 10.162.84.67 10.162.84.68 10.162.84.69 10.162.84.70 10.162.84.71 10.162.84.72 10.162.84.73 10.162.84.74 10.162.84.75 10.162.84.76 10.162.84.77 10.162.84.78 10.162.84.79 10.162.84.80 10.162.84.81 10.162.84.82 10.162.84.83 10.162.84.84 10.162.84.85 10.162.84.86 10.162.84.87 10.162.84.88 10.162.84.89 10.162.84.90 10.162.84.91 10.162.84.92 10.162.84.93 10.162.84.94 10.162.84.95 10.162.84.96 10.162.84.97 10.162.84.98 10.162.84.99 10.162.84.100 10.162.84.101 10.162.84.102 10.162.84.103 10.162.84.104 10.162.84.105 10.162.84.106 10.162.84.107 10.162.84.108 10.162.84.109 10.162.84.110 10.162.84.111 10.162.84.112 10.162.84.113 10.162.84.114 10.162.84.115 10.162.84.116 10.162.84.117 10.162.84.118 10.162.84.119 10.162.84.120 10.162.84.121 10.162.84.122 10.162.84.123 10.162.84.124 10.162.84.125 10.162.84.126 10.162.84.127 10.162.84.128 10.162.84.129 10.162.84.130 10.162.84.131 10.162.84.132 10.162.84.133 10.162.84.134 10.162.84.135 10.162.84.136 10.162.84.137 10.162.84.138 10.162.84.139 10.162.84.140 10.162.84.141 10.162.84.142 10.162.84.143 10.162.84.144 10.162.84.145 10.162.84.146 10.162.84.147 10.162.84.148 10.162.84.149 10.162.84.150 10.162.84.151 10.162.84.152 10.162.84.153 10.162.84.154 10.162.84.155 10.162.84.156 10.162.84.157 10.162.84.158 10.162.84.159 10.162.84.160 10.162.84.161 10.162.84.162 10.162.84.162 10.162.84.164 10.162.84.165 10.162.84.166 10.162.84.167 10.162.84.168 10.162.84.169 10.162.84.170 10.162.84.171 10.162.84.172 10.162.84.173 10.162.84.174 10.162.84.175 10.162.84.176 10.162.84.177 10.162.84.178 10.162.84.179 10.162.84.180 10.162.84.181 10.162.84.182 10.162.84.183 10.162.84.184 10.162.84.185 10.162.84.186 10.162.84.187 10.162.84.188 10.162.84.189 10.162.84.190 10.162.84.191 10.162.84.192 10.162.84.193 10.162.84.194 10.162.84.195 10.162.84.196 10.162.84.197 10.162.84.198 10.162.84.199 10.162.84.200 10.162.84.201 10.162.84.202 10.162.84.203 10.162.84.204 10.162.84.205 10.162.84.206 10.162.84.207 10.162.84.208 10.162.84.209 10.162.84.210 10.162.84.211 10.162.84.212 10.162.84.213 10.162.84.214 10.162.84.215 10.162.84.216 10.162.84.217 10.162.84.218 10.162.84.219 10.162.84.220 10.162.84.221 10.162.84.222 10.162.84.223 10.162.84.224 10.162.84.225 10.162.84.226 10.162.84.227 10.162.84.228 10.162.84.229 10.162.84.230 10.162.84.231 10.162.84.232 10.162.84.233 10.162.84.234 10.162.84.235 10.162.84.236 10.162.84.237 10.162.84.238 10.162.84.239 10.162.84.240 10.162.84.241 10.162.84.242 10.162.84.243 10.162.84.244 10.162.84.245 10.162.84.246 10.162.84.247 10.162.84.248 10.162.84.249 10.162.84.250 10.162.84.251 10.162.84.252 10.162.84.253 10.162.84.254 10.162.84.255 } track_script { chk_nginx_service chk_manual_failover } notify "/usr/libexec/keepalived/nginx-ha-notify" }

[root@esdfwlbp0003 keepalived]#

[root@esdfwlbp0004 keepalived]# cat keepalived.conf global_defs { vrrp_version 3 }

vrrp_script chk_manual_failover { script "/usr/libexec/keepalived/nginx-ha-manual-failover" interval 10 weight 50 }

vrrp_script chk_nginx_service { script "/usr/libexec/keepalived/nginx-ha-check" interval 3 weight 50 }

vrrp_instance VI_1 { interface ens192 priority 100 virtual_router_id 51 advert_int 1 accept garp_master_refresh 5 garp_master_refresh_repeat 1 unicast_src_ip 10.162.84.6 unicast_peer { 10.162.84.5 } virtual_ipaddress { 10.162.84.7 10.162.84.8 10.162.84.9 10.162.84.10 10.162.84.11 10.162.84.12 10.162.84.13 10.162.84.14 10.162.84.15 10.162.84.16 10.162.84.17 10.162.84.18 10.162.84.19 10.162.84.20 10.162.84.21 10.162.84.22 10.162.84.23 10.162.84.24 10.162.84.25 10.162.84.26 10.162.84.27 10.162.84.28 10.162.84.29 10.162.84.30 10.162.84.31 10.162.84.32 10.162.84.33 10.162.84.34 10.162.84.35 10.162.84.36 10.162.84.37 10.162.84.38 10.162.84.39 10.162.84.40 10.162.84.41 10.162.84.42 10.162.84.43 10.162.84.44 10.162.84.45 10.162.84.46 10.162.84.47 10.162.84.48 10.162.84.49 10.162.84.50 10.162.84.51 10.162.84.52 10.162.84.53 10.162.84.54 10.162.84.55 10.162.84.56 10.162.84.57 10.162.84.58 10.162.84.59 10.162.84.60 10.162.84.61 10.162.84.62 10.162.84.63 10.162.84.64 10.162.84.65 10.162.84.66 10.162.84.67 10.162.84.68 10.162.84.69 10.162.84.70 10.162.84.71 10.162.84.72 10.162.84.73 10.162.84.74 10.162.84.75 10.162.84.76 10.162.84.77 10.162.84.78 10.162.84.79 10.162.84.80 10.162.84.81 10.162.84.82 10.162.84.83 10.162.84.84 10.162.84.85 10.162.84.86 10.162.84.87 10.162.84.88 10.162.84.89 10.162.84.90 10.162.84.91 10.162.84.92 10.162.84.93 10.162.84.94 10.162.84.95 10.162.84.96 10.162.84.97 10.162.84.98 10.162.84.99 10.162.84.100 10.162.84.101 10.162.84.102 10.162.84.103 10.162.84.104 10.162.84.105 10.162.84.106 10.162.84.107 10.162.84.108 10.162.84.109 10.162.84.110 10.162.84.111 10.162.84.112 10.162.84.113 10.162.84.114 10.162.84.115 10.162.84.116 10.162.84.117 10.162.84.118 10.162.84.119 10.162.84.120 10.162.84.121 10.162.84.122 10.162.84.123 10.162.84.124 10.162.84.125 10.162.84.126 10.162.84.127 10.162.84.128 10.162.84.129 10.162.84.130 10.162.84.131 10.162.84.132 10.162.84.133 10.162.84.134 10.162.84.135 10.162.84.136 10.162.84.137 10.162.84.138 10.162.84.139 10.162.84.140 10.162.84.141 10.162.84.142 10.162.84.143 10.162.84.144 10.162.84.145 10.162.84.146 10.162.84.147 10.162.84.148 10.162.84.149 10.162.84.150 10.162.84.151 10.162.84.152 10.162.84.153 10.162.84.154 10.162.84.155 10.162.84.156 10.162.84.157 10.162.84.158 10.162.84.159 10.162.84.160 10.162.84.161 10.162.84.162 10.162.84.162 10.162.84.164 10.162.84.165 10.162.84.166 10.162.84.167 10.162.84.168 10.162.84.169 10.162.84.170 10.162.84.171 10.162.84.172 10.162.84.173 10.162.84.174 10.162.84.175 10.162.84.176 10.162.84.177 10.162.84.178 10.162.84.179 10.162.84.180 10.162.84.181 10.162.84.182 10.162.84.183 10.162.84.184 10.162.84.185 10.162.84.186 10.162.84.187 10.162.84.188 10.162.84.189 10.162.84.190 10.162.84.191 10.162.84.192 10.162.84.193 10.162.84.194 10.162.84.195 10.162.84.196 10.162.84.197 10.162.84.198 10.162.84.199 10.162.84.200 10.162.84.201 10.162.84.202 10.162.84.203 10.162.84.204 10.162.84.205 10.162.84.206 10.162.84.207 10.162.84.208 10.162.84.209 10.162.84.210 10.162.84.211 10.162.84.212 10.162.84.213 10.162.84.214 10.162.84.215 10.162.84.216 10.162.84.217 10.162.84.218 10.162.84.219 10.162.84.220 10.162.84.221 10.162.84.222 10.162.84.223 10.162.84.224 10.162.84.225 10.162.84.226 10.162.84.227 10.162.84.228 10.162.84.229 10.162.84.230 10.162.84.231 10.162.84.232 10.162.84.233 10.162.84.234 10.162.84.235 10.162.84.236 10.162.84.237 10.162.84.238 10.162.84.239 10.162.84.240 10.162.84.241 10.162.84.242 10.162.84.243 10.162.84.244 10.162.84.245 10.162.84.246 10.162.84.247 10.162.84.248 10.162.84.249 10.162.84.250 10.162.84.251 10.162.84.252 10.162.84.253 10.162.84.254 10.162.84.255 } track_script { chk_nginx_service chk_manual_failover } notify "/usr/libexec/keepalived/nginx-ha-notify" }

[root@esdfwlbp0004 keepalived]# Notify and track scripts If any notify or track scripts are in use, please provide copies of them [root@esdfwlbp0003 keepalived]# cat /usr/libexec/keepalived/nginx-ha-check

!/bin/sh

PATH=/bin:/sbin:/usr/bin:/usr/sbin

STATEFILE=/var/run/nginx-ha-keepalived.state

if [ -s "$STATEFILE" ]; then . "$STATEFILE" case "$STATE" in "BACKUP"|"MASTER"|"FAULT") service nginx status || service nginx-debug status exit $? ;; *|"") logger -t nginx-ha-keepalived "Unknown state: '$STATE'" exit 1 ;; esac fi

service nginx status exit $? [root@esdfwlbp0003 keepalived]#

[root@esdfwlbp0003 keepalived]# cat /usr/libexec/keepalived/nginx-ha-notify

!/bin/sh

PATH=/bin:/sbin:/usr/bin:/usr/sbin

umask 022

TYPE=$1 NAME=$2 STATE=$3

STATEFILE=/var/run/nginx-ha-keepalived.state

logger -t nginx-ha-keepalived "Transition to state '$STATE' on VRRP instance '$NAME'."

case $STATE in "MASTER") service nginx start ||: echo "STATE=$STATE" > $STATEFILE exit 0 ;; "BACKUP"|"FAULT") echo "STATE=$STATE" > $STATEFILE exit 0 ;; *) logger -t nginx-ha-keepalived "Unknown state: '$STATE'" exit 1 ;; esac [root@esdfwlbp0003 keepalived]#

[root@esdfwlbp0004 keepalived]# cat /usr/libexec/keepalived/nginx-ha-notify

!/bin/sh

PATH=/bin:/sbin:/usr/bin:/usr/sbin

umask 022

TYPE=$1 NAME=$2 STATE=$3

STATEFILE=/var/run/nginx-ha-keepalived.state

logger -t nginx-ha-keepalived "Transition to state '$STATE' on VRRP instance '$NAME'."

case $STATE in "MASTER") service nginx start ||: echo "STATE=$STATE" > $STATEFILE exit 0 ;; "BACKUP"|"FAULT") echo "STATE=$STATE" > $STATEFILE exit 0 ;; *) logger -t nginx-ha-keepalived "Unknown state: '$STATE'" exit 1 ;; esac [root@esdfwlbp0004 keepalived]# [root@esdfwlbp0004 keepalived]# cat /usr/libexec/keepalived/nginx-ha-check

!/bin/sh

PATH=/bin:/sbin:/usr/bin:/usr/sbin

STATEFILE=/var/run/nginx-ha-keepalived.state

if [ -s "$STATEFILE" ]; then . "$STATEFILE" case "$STATE" in "BACKUP"|"MASTER"|"FAULT") service nginx status || service nginx-debug status exit $? ;; *|"") logger -t nginx-ha-keepalived "Unknown state: '$STATE'" exit 1 ;; esac fi

service nginx status exit $? [root@esdfwlbp0004 keepalived]#

System Log entries Full keepalived system log entries from when keepalived started attached. the files are big. [root@esdfwlbp0003 nginx]# cat keepalived.log-20240923 | grep STATE Sep 23 10:13:28 esdfwlbp0003 Keepalived_vrrp[1829]: (VI_1) Entering BACKUP STATE Sep 23 10:13:31 esdfwlbp0003 Keepalived_vrrp[1829]: (VI_1) Entering MASTER STATE Sep 23 12:02:01 esdfwlbp0003 Keepalived_vrrp[1829]: (VI_1) Entering BACKUP STATE Sep 23 12:02:06 esdfwlbp0003 Keepalived_vrrp[1829]: (VI_1) Entering MASTER STATE Sep 23 12:03:56 esdfwlbp0003 Keepalived_vrrp[1829]: (VI_1) Entering BACKUP STATE Sep 23 12:04:20 esdfwlbp0003 Keepalived_vrrp[1829]: (VI_1) Entering MASTER STATE [root@esdfwlbp0003 nginx]#

Did keepalived coredump? If so, can you please provide a stacktrace from the coredump, using gdb. Not sure. keepalived_log_0004.txt

Additional context Add any other context about the problem here. Whenever there is a state change from MASTER to BACKUP or vice-versa we see health check failures even though servers are healthy. Its happening from the nginx VMs (active and passive nodes). I have re-installed nginx-plus and keepalived after upgrade to rhel8 from the rhel8 repo. Just added nginx log to show the health check failures whenever keepalived changes STATE. 2024/09/23 12:04:42 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.29:9080 in upstream "187266_refunds.dfwd1.aa.com_DFW_Prod" 2024/09/23 12:04:42 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.29:9084 in upstream "187266_refunds.dfwd1.aa.com_DFW_Prod" 2024/09/23 12:04:42 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.25:9082 in upstream "187266_finance-impl-gw1.dfwd1.aa.com_DFW_Prod" 2024/09/23 12:04:42 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.25:9084 in upstream "187266_finance-impl-gw2.dfwd1.aa.com_DFW_Prod" 2024/09/23 12:04:42 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.25:9089 in upstream "187266_finance-impl-persistor.dfwd1.aa.com_DFW_Prod" 2024/09/23 12:04:42 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.23:9087 in upstream "187266_finance-impl-process.dfwd1.aa.com_DFW_Prod" 2024/09/23 12:04:42 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.25:9085 in upstream "187266_finance-impl-gw2.dfwd1.aa.com_DFW_Prod" 2024/09/23 12:04:42 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.23:9082 in upstream "187266_finance-impl-gw1.dfwd1.aa.com_DFW_Prod" 2024/09/23 12:04:42 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.25:9087 in upstream "187266_finance-impl-process.dfwd1.aa.com_DFW_Prod" 2024/09/23 12:04:42 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.25:9086 in upstream "187266_finance-impl-process.dfwd1.aa.com_DFW_Prod" 2024/09/23 12:04:42 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "141950_tts.techops.aa.com_DFW_Prod_Http_Mon_8080" of peer 10.162.4.104:8080 in upstream "141950_tts.techops.aa.com_DFW_Prod", port 8080 2024/09/23 12:04:42 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.23:9083 in upstream "187266_finance-impl-gw1.dfwd1.aa.com_DFW_Prod" 2024/09/23 12:04:42 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.25:9088 in upstream "187266_finance-impl-persistor.dfwd1.aa.com_DFW_Prod" 2024/09/23 12:04:42 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.25:9083 in upstream "187266_finance-impl-gw1.dfwd1.aa.com_DFW_Prod" 2024/09/23 12:04:42 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.23:9089 in upstream "187266_finance-impl-persistor.dfwd1.aa.com_DFW_Prod" 2024/09/23 12:04:42 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.23:9085 in upstream "187266_finance-impl-gw2.dfwd1.aa.com_DFW_Prod" 2024/09/23 12:04:42 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.23:9086 in upstream "187266_finance-impl-process.dfwd1.aa.com_DFW_Prod" 2024/09/23 12:04:42 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.23:9084 in upstream "187266_finance-impl-gw2.dfwd1.aa.com_DFW_Prod" 2024/09/23 12:04:42 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.23:9088 in upstream "187266_finance-impl-persistor.dfwd1.aa.com_DFW_Prod" 2024/09/23 12:05:11 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "187266_finance-impl-matrix-analyst.dfwd1.aa.com_DFW_Prod_HTTP_Mon" of peer 10.162.4.25:9080 in upstream "187266_finance-impl-matrix-analyst.dfwd1.aa.com_DFW_Prod" 2024/09/23 12:05:11 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "187266_finance-impl-matrix-analyst.dfwd1.aa.com_DFW_Prod_HTTP_Mon" of peer 10.162.4.25:9081 in upstream "187266_finance-impl-matrix-analyst.dfwd1.aa.com_DFW_Prod" 2024/09/23 12:05:11 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "187266_finance-impl-matrix-analyst.dfwd1.aa.com_DFW_Prod_HTTP_Mon" of peer 10.162.4.23:9080 in upstream "187266_finance-impl-matrix-analyst.dfwd1.aa.com_DFW_Prod" 2024/09/23 12:05:11 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "187266_finance-impl-matrix-analyst.dfwd1.aa.com_DFW_Prod_HTTP_Mon" of peer 10.162.4.23:9081 in upstream "187266_finance-impl-matrix-analyst.dfwd1.aa.com_DFW_Prod" 2024/09/23 17:19:22 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.115:80 in upstream "6180080_Airvaultplusservice-dr.corpaa.aa.com_DFW_Prod", port 8080 2024/09/23 17:19:22 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.116:80 in upstream "6180080_qualificationsstoplight-dr.corpaa.aa.com_DFW_Prod" 2024/09/23 17:19:22 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.113:80 in upstream "2340761_effectivityservice-dr.techops.aa.com_DFW_PROD" 2024/09/23 17:19:22 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.114:80 in upstream "6180080_Qualificationsstoplightws-dr.corpaa.aa.com_DFW_Prod" 2024/09/23 17:19:23 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.25:9086 in upstream "187266_finance-impl-process.dfwd1.aa.com_DFW_Prod" 2024/09/23 17:19:23 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.25:9082 in upstream "187266_finance-impl-gw1.dfwd1.aa.com_DFW_Prod" 2024/09/23 17:19:23 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.25:9083 in upstream "187266_finance-impl-gw1.dfwd1.aa.com_DFW_Prod" 2024/09/23 17:19:23 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.25:9085 in upstream "187266_finance-impl-gw2.dfwd1.aa.com_DFW_Prod" 2024/09/23 17:19:23 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.25:9084 in upstream "187266_finance-impl-gw2.dfwd1.aa.com_DFW_Prod" 2024/09/23 17:19:23 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.25:9089 in upstream "187266_finance-impl-persistor.dfwd1.aa.com_DFW_Prod" 2024/09/23 17:19:23 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.23:9084 in upstream "187266_finance-impl-gw2.dfwd1.aa.com_DFW_Prod" 2024/09/23 17:19:23 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.25:9087 in upstream "187266_finance-impl-process.dfwd1.aa.com_DFW_Prod" 2024/09/23 17:19:23 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.23:9085 in upstream "187266_finance-impl-gw2.dfwd1.aa.com_DFW_Prod" 2024/09/23 17:19:23 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.25:9088 in upstream "187266_finance-impl-persistor.dfwd1.aa.com_DFW_Prod" 2024/09/23 17:19:23 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.23:9083 in upstream "187266_finance-impl-gw1.dfwd1.aa.com_DFW_Prod" 2024/09/23 17:19:23 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "141950_tts.techops.aa.com_DFW_Prod_Http_Mon_8080" of peer 10.162.4.104:8080 in upstream "141950_tts.techops.aa.com_DFW_Prod", port 8080 2024/09/23 17:19:23 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.23:9087 in upstream "187266_finance-impl-process.dfwd1.aa.com_DFW_Prod" 2024/09/23 17:19:23 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.23:9086 in upstream "187266_finance-impl-process.dfwd1.aa.com_DFW_Prod" 2024/09/23 17:19:23 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.23:9088 in upstream "187266_finance-impl-persistor.dfwd1.aa.com_DFW_Prod" 2024/09/23 17:19:23 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.23:9089 in upstream "187266_finance-impl-persistor.dfwd1.aa.com_DFW_Prod" 2024/09/23 17:19:23 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.23:9082 in upstream "187266_finance-impl-gw1.dfwd1.aa.com_DFW_Prod" 2024/09/23 17:19:27 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "187266_finance-impl-matrix-analyst.dfwd1.aa.com_DFW_Prod_HTTP_Mon" of peer 10.162.4.25:9080 in upstream "187266_finance-impl-matrix-analyst.dfwd1.aa.com_DFW_Prod" 2024/09/23 17:19:27 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "187266_finance-impl-matrix-analyst.dfwd1.aa.com_DFW_Prod_HTTP_Mon" of peer 10.162.4.23:9081 in upstream "187266_finance-impl-matrix-analyst.dfwd1.aa.com_DFW_Prod" 2024/09/23 17:19:27 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "187266_finance-impl-matrix-analyst.dfwd1.aa.com_DFW_Prod_HTTP_Mon" of peer 10.162.4.23:9080 in upstream "187266_finance-impl-matrix-analyst.dfwd1.aa.com_DFW_Prod" 2024/09/23 17:19:27 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "187266_finance-impl-matrix-analyst.dfwd1.aa.com_DFW_Prod_HTTP_Mon" of peer 10.162.4.25:9081 in upstream "187266_finance-impl-matrix-analyst.dfwd1.aa.com_DFW_Prod" 2024/09/23 17:19:53 [warn] 1522540#1522540: peer is unhealthy while connecting to upstream, health check "http_1_1_New_Resp_Codes" of peer 10.162.4.29:9083 in upstream "187266_activitylogger.dfwd1.aa.com_DFW_Prod" [root@esdfwlbp0004 nginx]#

Kindly help! Thanks! Adi

pqarmitage commented 1 month ago

You appear to have a network problem on esdfwlbp0003.

The basic VRRP priority for VI_1 on esdfwlbp0003 is 101. This is increased by 50 when track script _chk_manualfailover is successful, and a further 50 when _chk_nginxservice is successful. This gives a normal VRRP priority of 201, which is reduced to 151 when 1 script is successful, and to 101 when neither script is successful. For esdfwlbp0004 the basic VRRP priority is 100, the track scripts both have weight 50 and so the normal priority is 200 which reduces to 150 and 100 respectively.

Looking at the logs for esdfwlbp0004 we can see that there are a number of occasions when esdfwlbp0003 sends adverts with priority 151; in other words one of its track scripts is failing. Since these are often for bursts of 2 to 3 seconds, it would appear that it is the _chk_nginxservice script that is failing, is then successful the next time it is run.

Looking at the logs of esdfwlbp0004 merged with the extracts you have provided from the logs of esdfwlbp0003, and picking one example:

Sep 23 12:00:24 esdfwlbp0004 Keepalived_vrrp[496246]: (VI_1) Entering BACKUP STATE
Sep 23 12:00:24 esdfwlbp0004 Keepalived_vrrp[496246]: (VI_1) removing VIPs.
Sep 23 12:01:58 esdfwlbp0004 Keepalived_vrrp[496246]: (VI_1) received lower priority (151) advert from 10.162.84.5 - discarding
Sep 23 12:01:59 esdfwlbp0004 Keepalived_vrrp[496246]: (VI_1) received lower priority (151) advert from 10.162.84.5 - discarding
Sep 23 12:02:00 esdfwlbp0004 Keepalived_vrrp[496246]: (VI_1) received lower priority (151) advert from 10.162.84.5 - discarding
Sep 23 12:02:01 esdfwlbp0004 Keepalived_vrrp[496246]: (VI_1) Receive advertisement timeout
Sep 23 12:02:01 esdfwlbp0004 Keepalived_vrrp[496246]: (VI_1) Entering MASTER STATE
Sep 23 12:02:01 esdfwlbp0003 Keepalived_vrrp[1829]: (VI_1) Entering BACKUP STATE
Sep 23 12:02:01 esdfwlbp0004 Keepalived_vrrp[496246]: (VI_1) setting VIPs.
Sep 23 12:02:06 esdfwlbp0003 Keepalived_vrrp[1829]: (VI_1) Entering MASTER STATE
Sep 23 12:02:06 esdfwlbp0004 Keepalived_vrrp[496246]: (VI_1) Master received advert from 10.162.84.5 with higher priority 201, ours 200
Sep 23 12:02:06 esdfwlbp0004 Keepalived_vrrp[496246]: (VI_1) Entering BACKUP STATE
Sep 23 12:02:06 esdfwlbp0004 Keepalived_vrrp[496246]: (VI_1) removing VIPs.

we can see: at 12:00:24 esdfwlbp0004 enters BACKUP state (this is OK). at 12:01:58 - 12:02:00 esdfwlbp0004 receives adverts with priority 151, which it discards since the priority is lower than its own of 200. at 12:02:01 esdfwlbp0004 times out since it has not received a higher priority advert for 3 and a bit advert intervals and takes over as master. esdfwlbp0003 receives the higher priority (200 while its own priority is 151) advert from esdfwlbp0004 and so enters backup state. 5 seconds ater esdfwlbp0003 enters master state (because the track script is successful and its priority is now 201 again), esdfwlbp0004 receives the higher priority advert and reverts to backup state.

keepalived is functioning as it should, and is reacting to the responses from the track scripts. You will need to identify why your track scripts are failing occasionally and resolve that problem, since that is the cause of your issue.

BTW what is the netmask of the 10.162.84.5 subnet? It could be helpful if you can post the output of ip addr show ens192 from both systems.

Adi-AA commented 1 month ago

Hi! Thanks for the details. I am not sure where can we the track script is failing. What is the use of track script and can we see if its failing anywhere in the logs? Is the track script good on the 0004? How can you tell? Here are the outputs you have requested. the netmask is /22. For now I have disabled keepalived on 0004 so that the 0003 can be stable. If it stays stable what does it mean? Since it is not receiving adverts from 0004 thats why it is good? [root@esdfwlbp0003 nginx]# ip addr show ens192 2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 00:50:56:a2:3f:0d brd ff:ff:ff:ff:ff:ff altname enp11s0 inet 10.162.84.5/22 brd 10.162.87.255 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.7/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.8/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.9/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.10/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.11/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.12/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.13/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.14/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.15/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.16/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.17/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.18/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.19/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.20/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.21/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.22/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.23/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.24/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.25/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.26/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.27/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.28/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.29/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.30/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.31/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.32/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.33/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.34/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.35/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.36/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.37/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.38/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.39/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.40/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.41/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.42/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.43/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.44/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.45/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.46/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.47/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.48/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.49/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.50/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.51/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.52/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.53/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.54/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.55/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.56/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.57/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.58/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.59/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.60/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.61/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.62/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.63/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.64/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.65/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.66/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.67/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.68/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.69/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.70/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.71/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.72/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.73/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.74/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.75/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.76/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.77/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.78/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.79/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.80/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.81/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.82/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.83/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.84/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.85/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.86/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.87/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.88/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.89/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.90/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.91/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.92/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.93/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.94/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.95/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.96/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.97/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.98/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.99/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.100/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.101/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.102/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.103/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.104/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.105/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.106/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.107/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.108/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.109/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.110/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.111/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.112/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.113/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.114/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.115/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.116/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.117/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.118/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.119/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.120/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.121/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.122/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.123/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.124/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.125/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.126/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.127/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.128/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.129/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.130/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.131/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.132/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.133/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.134/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.135/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.136/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.137/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.138/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.139/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.140/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.141/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.142/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.143/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.144/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.145/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.146/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.147/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.148/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.149/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.150/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.151/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.152/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.153/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.154/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.155/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.156/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.157/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.158/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.159/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.160/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.161/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.162/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.164/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.165/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.166/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.167/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.168/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.169/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.170/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.171/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.172/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.173/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.174/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.175/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.176/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.177/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.178/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.179/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.180/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.181/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.182/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.183/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.184/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.185/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.186/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.187/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.188/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.189/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.190/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.191/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.192/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.193/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.194/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.195/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.196/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.197/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.198/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.199/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.200/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.201/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.202/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.203/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.204/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.205/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.206/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.207/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.208/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.209/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.210/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.211/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.212/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.213/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.214/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.215/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.216/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.217/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.218/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.219/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.220/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.221/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.222/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.223/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.224/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.225/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.226/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.227/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.228/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.229/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.230/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.231/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.232/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.233/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.234/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.235/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.236/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.237/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.238/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.239/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.240/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.241/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.242/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.243/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.244/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.245/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.246/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.247/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.248/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.249/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.250/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.251/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.252/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.253/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.254/32 scope global ens192 valid_lft forever preferred_lft forever inet 10.162.84.255/32 scope global ens192 valid_lft forever preferred_lft forever [root@esdfwlbp0003 nginx]# AND [root@esdfwlbp0004 nginx]# ip addr show ens192 2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 00:50:56:a2:02:93 brd ff:ff:ff:ff:ff:ff altname enp11s0 inet 10.162.84.6/22 brd 10.162.87.255 scope global noprefixroute ens192 valid_lft forever preferred_lft forever [root@esdfwlbp0004 nginx]#

Adi-AA commented 1 month ago

It is happening on all VMs that are on RHEL8. And I see in the message logs. Do you know what it means by the log? Any suggestion is valuable as we are in a very unstable state. For now I have stopped keepalived on the 0004 so that it will not advertise. Do you think there is a communication problem between 0003 and 0004? What causes the nginx-ha-check to be unsuccessful occasionally. And the issue clears itself automatically which is strange after like 20-30 mins.And why is it not failing on the 0004? [root@esdfwlbp0003 log]# cat messages | grep nginx-ha-check Sep 22 21:51:09 esdfwlbp0003 kernel: [3388545] 0 3388545 56999 125 81920 0 0 nginx-ha-check Sep 23 12:04:28 esdfwlbp0003 kernel: [ 162346] 0 162346 56999 113 94208 19 0 nginx-ha-check Sep 24 07:57:11 esdfwlbp0003 kernel: [1686640] 0 1686640 56999 194 86016 0 0 nginx-ha-check [root@esdfwlbp0003 log]#

!/bin/sh

PATH=/bin:/sbin:/usr/bin:/usr/sbin

STATEFILE=/var/run/nginx-ha-keepalived.state

if [ -s "$STATEFILE" ]; then . "$STATEFILE" case "$STATE" in "BACKUP"|"MASTER"|"FAULT") service nginx status || service nginx-debug status exit $? ;; *|"") logger -t nginx-ha-keepalived "Unknown state: '$STATE'" exit 1 ;; esac fi

service nginx status exit $?

pqarmitage commented 1 month ago

You should get log messages on 0003 when the track scripts change state. If you are not getting similar messages on 0004 then the scripts are not failing there.

If 0003 remains stable while 0004 is down it doesn't mean anything. If the track scripts on 0003 fail the VRRP instance will reduce its priority to 151 or 101, but since there isn't another VRRP instance running, there will not be a higher priority VRRP instance to take over as master.

You could add some debug output to your nginx-ha-check track script:

!/bin/sh

LOGFILE=/tmp/nginx-ha-check.log

PATH=/bin:/sbin:/usr/bin:/usr/sbin

STATEFILE=/var/run/nginx-ha-keepalived.state echo -n $(date) STATE=$(cat $STATEFILE) " " >>$LOGFILE

if [ -s "$STATEFILE" ]; then . "$STATEFILE" case "$STATE" in "BACKUP"|"MASTER"|"FAULT")

service nginx status || service nginx-debug status

service nginx status STATUS=$? if [[ $STATUS ]]; then echo nginx status 0 >>$LOGFILE exit 0 else nginx-debug status STATUS=$? echo nginx-debug status $STATUS >>$LOGFILE exit $STATUS fi exit $? ;; *|"") logger -t nginx-ha-keepalived "Unknown state: '$STATE'" echo "Unknown state: $STATE" >>$LOGFILE exit 1 ;; esac fi

service nginx status STATUS=$? echo no statefile status $STATUS >>$LOGFILE exit $STATUS

The log file /tmp/nginx-ha-check.log might then give you some idea about what is happening. However, you should have log entries from the nginx-ha-notify script that tell you what is happening.

Adi-AA commented 1 month ago

Ok. I have enabled the script. Thanks for the script. If there is a communication problem between the peers, wouldnt that show somewhere in the message logs or network logs? How kind of messages the VRRP send to the peer? Is it the "Sending/queueing gratuitous ARPs", messages? If there is a communication problem wont we see a drop in oone of these ARP messages ? I am curious on how the peers exchange messages because the track scripts only check the status on each of its own devices, like service status nginx. So what will be determined based on that? If the peers receive "Receive advertisement timeout" that means somewhere they are being interrupted. Since they are on the same virtual network am not sure where to check. Any clues? Thanks a lot!

Adi-AA commented 1 month ago

If I do a tail on the keepalived log I see bursts of "Sending gratuitous ARP on..". If they are not sent every second how will the peer know if the other peer is not able to communicate? I am trying to understand how keepalived works and how failover happens basically. And I dont these "Sending gratuitous ARP on" on the standby, is that correct? So in this case the standby is not receiving the adverts from active and thats why we see timeouts on the standby logs and it becomes MASTER, correct?. I also see advert timeouts on the active VM logs too. So adverts are sent by both peers? Then only if BACKUP doesn't receive adverts thats when it becomes MASTER and when the now BACKUP VM doesn't gets the adverts in time then it will become MASTER again? So, our focus should be on why we are seeing advert timeouts rather than the nginx-ha-check?

Adi-AA commented 1 month ago

And what could be the reason for not sending the messages "Sending gratuitous ARP .." for more than 30 mins from the MASTER? I am seeing this in one of the other pairs in lab. The keepalived service is running but dont see any messages. [root@eshdqlbt0001 nginx]# systemctl status keepalived ● keepalived.service - LVS and VRRP High Availability Monitor Loaded: loaded (/usr/lib/systemd/system/keepalived.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2024-09-25 17:19:08 CDT; 37min ago Process: 829278 ExecStart=/usr/sbin/keepalived $KEEPALIVED_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 829279 (keepalived) Tasks: 2 (limit: 100450) Memory: 14.6M CGroup: /system.slice/keepalived.service ├─829279 /usr/sbin/keepalived -D └─829280 /usr/sbin/keepalived -D

Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: (VI_2) Sending/queueing gratuitous ARPs on ens161 for 172.18.83.174 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: Sending gratuitous ARP on ens161 for 172.18.83.174 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: (VI_2) Sending/queueing gratuitous ARPs on ens161 for 172.18.83.175 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: Sending gratuitous ARP on ens161 for 172.18.83.175 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: (VI_2) Sending/queueing gratuitous ARPs on ens161 for 172.18.83.180 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: Sending gratuitous ARP on ens161 for 172.18.83.180 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: (VI_2) Sending/queueing gratuitous ARPs on ens161 for 172.18.83.183 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: Sending gratuitous ARP on ens161 for 172.18.83.183 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: (VI_2) Sending/queueing gratuitous ARPs on ens161 for 172.18.83.185 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: Sending gratuitous ARP on ens161 for 172.18.83.185 [root@eshdqlbt0001 nginx]# clock 2024-09-25 17:56:49.166197-05:00 [root@eshdqlbt0001 nginx]# cd /etc/keepalived/ [root@eshdqlbt0001 keepalived]# ll total 56 -rw-r--r-- 1 root root 26778 Sep 11 17:39 keepalived.conf -rw------- 1 root root 26787 Jun 13 2022 keepalived.conf.bak [root@eshdqlbt0001 keepalived]# cat keepalived.conf | grep router virtual_router_id 51 virtual_router_id 52 virtual_router_id 53 virtual_router_id 54 [root@eshdqlbt0001 keepalived]# ps -ef | grep keepalived root 829279 1 0 17:19 ? 00:00:00 /usr/sbin/keepalived -D root 829280 829279 0 17:19 ? 00:00:00 /usr/sbin/keepalived -D root 872701 821591 0 18:02 pts/1 00:00:00 grep --color=auto keepalived [root@eshdqlbt0001 keepalived]# systemctl status keepalived ● keepalived.service - LVS and VRRP High Availability Monitor Loaded: loaded (/usr/lib/systemd/system/keepalived.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2024-09-25 17:19:08 CDT; 43min ago Process: 829278 ExecStart=/usr/sbin/keepalived $KEEPALIVED_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 829279 (keepalived) Tasks: 2 (limit: 100450) Memory: 14.2M CGroup: /system.slice/keepalived.service ├─829279 /usr/sbin/keepalived -D └─829280 /usr/sbin/keepalived -D

Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: (VI_2) Sending/queueing gratuitous ARPs on ens161 for 172.18.83.174 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: Sending gratuitous ARP on ens161 for 172.18.83.174 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: (VI_2) Sending/queueing gratuitous ARPs on ens161 for 172.18.83.175 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: Sending gratuitous ARP on ens161 for 172.18.83.175 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: (VI_2) Sending/queueing gratuitous ARPs on ens161 for 172.18.83.180 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: Sending gratuitous ARP on ens161 for 172.18.83.180 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: (VI_2) Sending/queueing gratuitous ARPs on ens161 for 172.18.83.183 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: Sending gratuitous ARP on ens161 for 172.18.83.183 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: (VI_2) Sending/queueing gratuitous ARPs on ens161 for 172.18.83.185 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: Sending gratuitous ARP on ens161 for 172.18.83.185 [root@eshdqlbt0001 keepalived]# systemctl status keepalived ● keepalived.service - LVS and VRRP High Availability Monitor Loaded: loaded (/usr/lib/systemd/system/keepalived.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2024-09-25 17:19:08 CDT; 45min ago Main PID: 829279 (keepalived) Tasks: 3 (limit: 100450) Memory: 14.2M CGroup: /system.slice/keepalived.service ├─829279 /usr/sbin/keepalived -D └─829280 /usr/sbin/keepalived -D

Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: (VI_2) Sending/queueing gratuitous ARPs on ens161 for 172.18.83.174 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: Sending gratuitous ARP on ens161 for 172.18.83.174 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: (VI_2) Sending/queueing gratuitous ARPs on ens161 for 172.18.83.175 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: Sending gratuitous ARP on ens161 for 172.18.83.175 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: (VI_2) Sending/queueing gratuitous ARPs on ens161 for 172.18.83.180 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: Sending gratuitous ARP on ens161 for 172.18.83.180 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: (VI_2) Sending/queueing gratuitous ARPs on ens161 for 172.18.83.183 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: Sending gratuitous ARP on ens161 for 172.18.83.183 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: (VI_2) Sending/queueing gratuitous ARPs on ens161 for 172.18.83.185 Sep 25 17:19:23 eshdqlbt0001 Keepalived_vrrp[829280]: Sending gratuitous ARP on ens161 for 172.18.83.185 [root@eshdqlbt0001 keepalived]#

pqarmitage commented 1 month ago

Ignore the gratuitous ARP messages, they are not related to the cause of the issue, but are a consequence of the issue.

The problem is caused by one of the track scripts on 0003 (I have guessed the script is chk_nginx_service) returning a non zero exit code. We are unable to help any further than that and you need to determine why the script on 0003 is returning that exit code.

Adi-AA commented 1 month ago

Ok. Can you explain to me how the chk_nginx_service works? I mean how does the status of the scripts determine the communication between the peers? The advert timeouts is something that need to be considered, right? And if the advert timeouts that means the peers are not communicating well enough every 3 seconds, is that right? Help me understand the keepalived work here, please. Have you seen this time of a problem before? Any other diagnostics or captures or tools you can recommend to know more in why these adverts happen would be greatly helpful. If we see "Receive advertisement timeout" frequently where do you think we should focus? Sep 25 08:59:31 eshdqlbt0002 Keepalived_vrrp[1509]: (VI_1) Receive advertisement timeout Sep 25 08:59:31 eshdqlbt0002 Keepalived_vrrp[1509]: (VI_2) Receive advertisement timeout Sep 25 08:59:31 eshdqlbt0002 Keepalived_vrrp[1509]: (VI_3) Receive advertisement timeout Sep 25 08:59:31 eshdqlbt0002 Keepalived_vrrp[1509]: (VI_4) Receive advertisement timeout Sep 25 21:57:38 eshdqlbt0002 Keepalived_vrrp[1509]: (VI_1) Receive advertisement timeout Sep 25 21:57:38 eshdqlbt0002 Keepalived_vrrp[1509]: (VI_3) Receive advertisement timeout Sep 25 21:57:38 eshdqlbt0002 Keepalived_vrrp[1509]: (VI_4) Receive advertisement timeout Sep 25 21:57:38 eshdqlbt0002 Keepalived_vrrp[1509]: (VI_2) Receive advertisement timeout Sep 25 22:00:38 eshdqlbt0002 Keepalived_vrrp[1509]: (VI_1) Receive advertisement timeout Sep 26 08:36:12 eshdqlbt0002 Keepalived_vrrp[1509]: (VI_2) Receive advertisement timeout Sep 26 08:36:12 eshdqlbt0002 Keepalived_vrrp[1509]: (VI_4) Receive advertisement timeout Sep 26 08:36:12 eshdqlbt0002 Keepalived_vrrp[1509]: (VI_3) Receive advertisement timeout Sep 26 08:36:12 eshdqlbt0002 Keepalived_vrrp[1509]: (VI_1) Receive advertisement timeout

pqarmitage commented 1 month ago

The following block defines the chk_nginx_service:

vrrp_script chk_nginx_service {
script "/usr/libexec/keepalived/nginx-ha-check"
interval 3
weight 50
}

This defines the script to run, that it is run every 3 seconds, and if the script returns success (i.e. exit code 0) 50 is added to the VRRP priority of any VRRP instance that has the script configured.

If both the chk_nginx_service and the chk_manual_failover, the VRRP priority of VI_1 will be 201 (i.e. 101 + 50 + 50). If the chk_nginx_service returns failure (i.e. non-zero exit code) the VRRP priority will reduce to 151.

The priority is included in the sent VRRP adverts.

When chk_nginx_service is returning failure on 0003, it starts sending adverts with priority 151, which is lower than the priority on 0004 (priority is 200). 0004 logs that it has received a lower priority and that it is discarding it. After 3 and a bit seconds, 0004 has not received any advert that it has not discarded, so the timer expires and it becomes master. In this circumstance the Receive advertisement timeout is quite normal and to be expected.

What is strange for example is the log entries

Sep 23 11:50:58 esdfwlbp0004 Keepalived_vrrp[496246]: (VI_1) received lower priority (151) advert from 10.162.84.5 - discarding
Sep 23 11:50:59 esdfwlbp0004 Keepalived_vrrp[496246]: (VI_1) received lower priority (151) advert from 10.162.84.5 - discarding

after which 0003 must increase it's priority to 201 again, since 0004 doesn't take over as master.

Also:

Sep 23 10:13:27 esdfwlbp0004 Keepalived_vrrp[496246]: (VI_1) received lower priority (151) advert from 10.162.84.5 - discarding
Sep 23 10:13:28 esdfwlbp0004 Keepalived_vrrp[496246]: (VI_1) Receive advertisement timeout
Sep 23 10:13:28 esdfwlbp0004 Keepalived_vrrp[496246]: (VI_1) Entering MASTER STATE

where only one lower priority advert is received before 0004 times out. This suggests that there are some missed adverts from 0003.

Adi-AA commented 1 month ago

So, the Receive advertisement timeout should be 3 in a row, is that correct? And each virtual interface will send an advert, right? If we have only one interface then we will see 3 adverts. If I lower the interval to one second will it help? So that the adverts wont be missed? What does it mean exactly - When chk_nginx_service is returning failure on 0003. Is the script unable to check the status at that time? OR there is no status returned? Can you provide any script to log why the nginx-ha-check fails? Thu Sep 26 01:45:24 CDT 2024 STATE=STATE=MASTER Thu Sep 26 01:45:27 CDT 2024 STATE=STATE=MASTER Thu Sep 26 01:45:30 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 01:45:33 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 01:46:54 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 01:46:57 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 01:47:34 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 01:47:40 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 01:47:43 CDT 2024 STATE=STATE=MASTER nginx status 0 Thu Sep 26 01:47:46 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 01:51:01 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 01:51:04 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 05:22:55 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:09:04 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:09:08 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:09:11 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 06:09:14 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:09:17 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:09:21 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:09:39 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:09:58 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:10:01 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:10:04 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 06:10:07 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:10:10 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:10:14 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:10:37 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:10:41 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:10:45 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:10:48 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:10:51 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:10:57 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:13:22 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:13:25 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:13:28 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:13:31 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:13:34 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:21:22 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 06:21:26 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 06:21:29 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:21:32 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:21:36 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:21:52 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:22:01 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 06:22:04 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 06:22:07 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 06:22:10 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 06:22:13 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:22:16 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:22:48 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:22:51 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:22:54 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 06:22:57 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 06:23:00 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 06:23:03 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 06:23:06 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 06:23:09 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 06:23:12 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 06:23:15 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 06:23:18 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 06:23:21 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 06:23:24 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:23:27 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:23:30 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:24:11 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:24:14 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:24:17 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:24:21 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:24:28 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:24:31 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:24:34 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:24:37 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:25:34 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:25:37 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:25:40 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 06:25:43 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:25:46 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:26:34 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:26:37 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:26:40 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:26:43 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 06:26:46 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 06:26:49 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 06:26:52 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:26:55 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:27:55 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:27:58 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:28:01 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:28:04 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:28:08 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 06:28:11 CDT 2024 STATE=STATE=MASTER nginx status 0 Thu Sep 26 06:29:14 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:29:17 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:29:20 CDT 2024 STATE=STATE=MASTER Thu Sep 26 06:29:23 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 06:29:26 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 10:22:31 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 10:31:45 CDT 2024 STATE=STATE=MASTER Thu Sep 26 10:31:48 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 10:31:51 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 10:31:54 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 10:43:07 CDT 2024 STATE=STATE=MASTER Thu Sep 26 10:43:10 CDT 2024 STATE=STATE=MASTER Thu Sep 26 10:43:13 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 10:43:16 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 10:43:19 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 10:43:46 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 10:43:50 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 10:43:53 CDT 2024 STATE=STATE=MASTER nginx status 0 Thu Sep 26 10:44:46 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 10:45:06 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 10:45:09 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 10:45:30 CDT 2024 STATE=STATE=MASTER Thu Sep 26 10:45:33 CDT 2024 STATE=STATE=BACKUP nginx status 0 Thu Sep 26 10:45:36 CDT 2024 STATE=STATE=BACKUP nginx status 0 [root@estullbs0001 nginx]#

This is the script: [root@estullbs0001 keepalived]# cat nginx-ha-check

!/bin/sh

LOGFILE=/opt/log/nginx/nginx-ha-check.log

PATH=/bin:/sbin:/usr/bin:/usr/sbin

STATEFILE=/var/run/nginx-ha-keepalived.state echo -n date STATE=(cat $STATEFILE) " " >>$LOGFILE

if [ -s "$STATEFILE" ]; then . "$STATEFILE" case "$STATE" in "BACKUP"|"MASTER"|"FAULT") service nginx status || service nginx-debug status

service nginx status STATUS=$? if [[ $STATUS ]]; then echo nginx status 0 >>$LOGFILE exit 0 else nginx-debug status STATUS=$? echo nginx-debug status $STATUS >>$LOGFILE exit $STATUS fi exit $? ;; *|"") logger -t nginx-ha-keepalived "Unknown state: '$STATE'" echo "Unknown state: $STATE" >>$LOGFILE exit 1 ;; esac fi

service nginx status STATUS=$? echo no statefile status $STATUS >>$LOGFILE exit $STATUS

Adi-AA commented 1 month ago

And do you think we should explore something similar in this thread - https://groups.io/g/keepalived-users/topic/master_instance_briefly_stops/75373214 . But we need data to prove that it is the write operations that are the cause of the missed adverts.

Adi-AA commented 1 month ago

Is there any diagnostic message that we can add to vrrp track script when it times out? Can we increase the timeout, currently I dont see a timeout for the script. On other pair of VMs we do see the below messages, and sometimes the forcing election message too. Received advert from 10.130.76.6 with lower priority 150, ours 151, forcing new election

Sep 25 21:57:30 eshdqlbt0001 Keepalived_vrrp[829280]: Track script chk_nginx_service is already running, expect idle - skipping run Sep 25 22:18:26 eshdqlbt0001 Keepalived_vrrp[829280]: Track script chk_manual_failover is already running, expect idle - skipping run Sep 25 22:18:26 eshdqlbt0001 Keepalived_vrrp[829280]: Track script chk_manual_failover is already running, expect idle - skipping run Sep 25 22:18:26 eshdqlbt0001 Keepalived_vrrp[829280]: Track script chk_manual_failover is already running, expect idle - skipping run Sep 26 08:36:04 eshdqlbt0001 Keepalived_vrrp[829280]: Track script chk_nginx_service is already running, expect idle - skipping run Sep 26 08:41:00 eshdqlbt0001 Keepalived_vrrp[829280]: Track script chk_manual_failover is already running, expect idle - skipping run Sep 26 08:42:23 eshdqlbt0001 Keepalived_vrrp[829280]: Track script chk_manual_failover is already running, expect idle - skipping run Sep 26 16:17:40 eshdqlbt0001 Keepalived_vrrp[829280]: Track script chk_nginx_service is already running, expect idle - skipping run Sep 26 16:19:11 eshdqlbt0001 Keepalived_vrrp[829280]: Track script chk_manual_failover is already running, expect idle - skipping run Sep 26 16:19:11 eshdqlbt0001 Keepalived_vrrp[829280]: Track script chk_manual_failover is already running, expect idle - skipping run Sep 26 16:19:44 eshdqlbt0001 Keepalived_vrrp[829280]: Track script chk_nginx_service is already running, expect idle - skipping run Sep 26 16:19:44 eshdqlbt0001 Keepalived_vrrp[829280]: Track script chk_manual_failover is already running, expect idle - skipping run

pqarmitage commented 1 month ago

So, the Receive advertisement timeout should be 3 in a row, is that correct? And each virtual interface will send an advert, right? The receive advert timeout is (3 + (256 - priority)/256) * advert interval.

If we have only one interface then we will see 3 adverts. I am not clear what you mean be this.

If I lower the interval to one second will it help? So that the adverts wont be missed? The advert interval is already configured to be 1 second. Reducing the advert interval will cause the receive timeout to occur quicker (see the first answer above).

What does it mean exactly - When chk_nginx_service is returning failure on 0003. The script /usr/libexec/keepalived/nginx-ha-check exits with a non-zero exit code on esdfwlbp0003

Is the script unable to check the status at that time? OR there is no status returned? I don't know what your scripts can or cannot do. It is not possible for a script to exit without an exit code/status.

Can you provide any script to log why the nginx-ha-check fails? No, we can't do this. You need to work out what your script is doing.

It would appear from the log entries you provide that /var/run/nginx-ha-keepalived.state contains STATE=BACKUP or STATE=MASTER, whereas the script expects just BACKUP or MASTER.

The line service nginx status || service nginx-debug status in the modified script should be commented out.

Something strange is happening causing lines like Thu Sep 26 10:31:45 CDT 2024 STATE=STATE=MASTER Thu Sep 26 10:31:48 CDT 2024 STATE=STATE=BACKUP Thu Sep 26 10:31:51 CDT 2024 STATE=STATE=BACKUP nginx status 0. So far as I can see the script should always write a newline to the logfile before exiting. This is perhaps caused by the same problem as the Sep 25 21:57:30 eshdqlbt0001 Keepalived_vrrp[829280]: Track script chk_nginx_service is already running, expect idle - skipping run log entries. It would appear that sometimes the script is taking more than 3 seconds before it exits.

I don't think https://groups.io/g/keepalived-users/topic/master_instance_briefly_stops/75373214 is relevant.

Is there any diagnostic message that we can add to vrrp track script when it times out? You could write the date/time to the log file when the track script exits.

Can we increase the timeout, currently I dont see a timeout for the script. There is a configuration option timeout for the vrrp_script block.

Adi-AA commented 1 month ago

Great! Thanks for your inputs/answers. What is the default timeout for the peers to wait before they change the STATE?

The nginx-ha-check scripts checks only the status of nginx, even though the status is good still the script is failing, meaning does the script has some problems while executing or it has problem's returning the exit code 1 or as you said it may be taking more than 3 seconds to return an exit code which may be why the STATE changes. Is that correct?

I have 3 interfaces in the keepalived config, that means adverts (VRRP messages) are sent to all 3 interfaces and sometimes one of them couldn't respond within 3 seconds (I dont know how long each peer will wait before it assumes the peer is dead) that is the reason we see "receive advert timeouts"? I am talking based on these messages in the logs.

Sep 26 10:44:46 estullbs0002 Keepalived_vrrp[1736]: (VI_1) Master received advert from 10.232.148.5 with higher priority 201, ours 200 Sep 26 10:44:48 estullbs0002 Keepalived_vrrp[1736]: (VI_3) Master received advert from 10.76.195.101 with higher priority 201, ours 200 Sep 26 10:44:48 estullbs0002 Keepalived_vrrp[1736]: (VI_2) Master received advert from 10.84.10.8 with higher priority 201, ours 200 Sep 26 10:44:49 estullbs0002 Keepalived_vrrp[1736]: (VI_1) Entering MASTER STATE Sep 27 09:12:14 estullbs0002 Keepalived_vrrp[1736]: (VI_2) Entering MASTER STATE Sep 27 09:12:14 estullbs0002 Keepalived_vrrp[1736]: (VI_1) Entering MASTER STATE Sep 27 09:12:14 estullbs0002 Keepalived_vrrp[1736]: (VI_3) Entering MASTER STATE Sep 27 09:12:17 estullbs0002 Keepalived_vrrp[1736]: (VI_2) Master received advert from 10.84.10.8 with higher priority 201, ours 200 Sep 27 09:12:18 estullbs0002 Keepalived_vrrp[1736]: (VI_1) Master received advert from 10.232.148.5 with higher priority 201, ours 200 Sep 27 09:12:21 estullbs0002 Keepalived_vrrp[1736]: (VI_2) Entering MASTER STATE Sep 27 09:12:21 estullbs0002 Keepalived_vrrp[1736]: (VI_1) Entering MASTER STATE Sep 27 09:12:23 estullbs0002 Keepalived_vrrp[1736]: (VI_3) Master received advert from 10.76.195.101 with higher priority 201, ours 200 Sep 27 09:12:23 estullbs0002 Keepalived_vrrp[1736]: (VI_2) Master received advert from 10.84.10.8 with higher priority 201, ours 200 Sep 27 09:12:25 estullbs0002 Keepalived_vrrp[1736]: (VI_1) Master received advert from 10.232.148.5 with higher priority 201, ours 200 Sep 27 09:12:26 estullbs0002 Keepalived_vrrp[1736]: (VI_3) Entering MASTER STATE Sep 27 09:12:27 estullbs0002 Keepalived_vrrp[1736]: (VI_2) Entering MASTER STATE Sep 27 09:12:28 estullbs0002 Keepalived_vrrp[1736]: (VI_1) Entering MASTER STATE Sep 27 09:12:28 estullbs0002 Keepalived_vrrp[1736]: (VI_1) Master received advert from 10.232.148.5 with higher priority 201, ours 200 Sep 27 09:12:31 estullbs0002 Keepalived_vrrp[1736]: (VI_3) Master received advert from 10.76.195.101 with higher priority 201, ours 200 Sep 27 09:12:31 estullbs0002 Keepalived_vrrp[1736]: (VI_2) Master received advert from 10.84.10.8 with higher priority 201, ours 200 Sep 27 09:12:31 estullbs0002 Keepalived_vrrp[1736]: (VI_1) Entering MASTER STATE Sep 27 09:12:33 estullbs0002 Keepalived_vrrp[1736]: (VI_1) Master received advert from 10.232.148.5 with higher priority 201, ours 200 Sep 27 09:12:34 estullbs0002 Keepalived_vrrp[1736]: (VI_3) Entering MASTER STATE Sep 27 09:12:35 estullbs0002 Keepalived_vrrp[1736]: (VI_3) Master received advert from 10.76.195.101 with higher priority 201, ours 200 Sep 27 09:12:40 estullbs0002 Keepalived_vrrp[1736]: (VI_2) Entering MASTER STATE Sep 27 09:12:40 estullbs0002 Keepalived_vrrp[1736]: (VI_1) Entering MASTER STATE Sep 27 09:12:40 estullbs0002 Keepalived_vrrp[1736]: (VI_3) Entering MASTER STATE Sep 27 09:12:49 estullbs0002 Keepalived_vrrp[1736]: (VI_2) Master received advert from 10.84.10.8 with higher priority 201, ours 200 Sep 27 09:12:51 estullbs0002 Keepalived_vrrp[1736]: (VI_1) Master received advert from 10.232.148.5 with higher priority 201, ours 200 Sep 27 09:12:52 estullbs0002 Keepalived_vrrp[1736]: (VI_2) Entering MASTER STATE Sep 27 09:15:39 estullbs0002 Keepalived_vrrp[1736]: (VI_1) Master received advert from 10.232.148.5 with higher priority 201, ours 200 Sep 27 09:15:39 estullbs0002 Keepalived_vrrp[1736]: (VI_3) Master received advert from 10.76.195.101 with higher priority 201, ours 200 Sep 27 09:15:39 estullbs0002 Keepalived_vrrp[1736]: (VI_2) Master received advert from 10.84.10.8 with higher priority 201, ours 200 [root@estullbs0002 nginx]#

Adi-AA commented 1 month ago

Hi! I see some logs are missing from file (After Oct 1 17:04:07 it went to Oct 1 17:15:09) and I see sometimes we dont see the interface VI_1 names in the logs. Can I know what could be the reason?

Oct 1 17:04:07 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan1211 for 10.84.8.32 Oct 1 17:04:07 estullbs0001 Keepalived_vrrp[2015]: (VI_2) Sending/queueing gratuitous ARPs on vlan1211 for 10.84.8.33 Oct 1 17:04:07 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan1211 for 10.84.8.33 Oct 1 17:04:07 estullbs0001 Keepalived_vrrp[2015]: (VI_2) Sending/queueing gratuitous ARPs on vlan1211 for 10.84.8.34 Oct 1 17:04:07 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan1211 for 10.84.8.34 Oct 1 17:04:07 estullbs0001 Keepalived_vrrp[2015]: (VI_2) Sending/queueing gratuitous ARPs on vlan1211 for 10.84.8.47 Oct 1 17:04:07 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan1211 for 10.84.8.47 Oct 1 17:04:07 estullbs0001 Keepalived_vrrp[2015]: (VI_2) Sending/queueing gratuitous ARPs on vlan1211 for 10.84.8.51 Oct 1 17:04:07 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan1211 for 10.84.8.51 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on Prod for 10.232.148.13 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_1) Sending/queueing gratuitous ARPs on Prod for 10.232.148.14 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on Prod for 10.232.148.14 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_1) Sending/queueing gratuitous ARPs on Prod for 10.232.148.15 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on Prod for 10.232.148.15 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_1) Sending/queueing gratuitous ARPs on Prod for 10.232.148.16 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on Prod for 10.232.148.16 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_1) Sending/queueing gratuitous ARPs on Prod for 10.232.148.17 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on Prod for 10.232.148.17

And I see the below log with "Thread timer expired some 231 sec ago", what does it mean?

Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_3) Sending/queueing gratuitous ARPs on vlan100 for 10.76.195.87 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.87 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: A thread timer expired 231.646664 seconds ago Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_3) Received advert from 10.76.195.102 with lower priority 200, ours 201 , forcing new election Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_3) Sending/queueing gratuitous ARPs on vlan100 for 10.76.195.48 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.48 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_3) Sending/queueing gratuitous ARPs on vlan100 for 10.76.195.60 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.60

And for some logs I dont see the interface name, why is that? For example: (VI_3) Sending/queueing gratuitous ARPs on vlan100 for 10.76.195.48 VS Sending gratuitous ARP on vlan100 for 10.76.195.48

Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_3) Sending/queueing gratuitous ARPs on vlan100 for 10.76.195.87 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.87 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: A thread timer expired 231.646664 seconds ago Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_3) Received advert from 10.76.195.102 with lower priority 200, ours 201 , forcing new election Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_3) Sending/queueing gratuitous ARPs on vlan100 for 10.76.195.48 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.48 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_3) Sending/queueing gratuitous ARPs on vlan100 for 10.76.195.60 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.60 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_3) Sending/queueing gratuitous ARPs on vlan100 for 10.76.195.62 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.62 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_3) Sending/queueing gratuitous ARPs on vlan100 for 10.76.195.69 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.69 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_3) Sending/queueing gratuitous ARPs on vlan100 for 10.76.195.70 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.70 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_3) Sending/queueing gratuitous ARPs on vlan100 for 10.76.195.75 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.75 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_3) Sending/queueing gratuitous ARPs on vlan100 for 10.76.195.87 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.87 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.48 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.60 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.62 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.69 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.70 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.75 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.87 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.48 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.60 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.62 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.69 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.70 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.75 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.87 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.48 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.60 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.62 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.69 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.70 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.75 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.87 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.48 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.60 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.62 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.69 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.70 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.75 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.87 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_3) Received advert from 10.76.195.102 with lower priority 200, ours 201 , forcing new election Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_3) Sending/queueing gratuitous ARPs on vlan100 for 10.76.195.48 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.48 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_3) Sending/queueing gratuitous ARPs on vlan100 for 10.76.195.60 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.60 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_3) Sending/queueing gratuitous ARPs on vlan100 for 10.76.195.62 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.62 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_3) Sending/queueing gratuitous ARPs on vlan100 for 10.76.195.69 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.69 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_3) Sending/queueing gratuitous ARPs on vlan100 for 10.76.195.70 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.70 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_3) Sending/queueing gratuitous ARPs on vlan100 for 10.76.195.75 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.75 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: (VI_3) Sending/queueing gratuitous ARPs on vlan100 for 10.76.195.87 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.87 Oct 1 17:15:09 estullbs0001 Keepalived_vrrp[2015]: Sending gratuitous ARP on vlan100 for 10.76.195.48

Adi-AA commented 1 month ago

@pqarmitage - Hi! Instead of using the system for executing the nginx-ha-check why not use run commands of “pgrep -lf nginx”, instead of service status. With the above, we would be checking on running processes, instead of systemd.

Is there any downside to check in this way for failover?

And do you suspect the advertisement time outs (received advertisement timeout) were a result of this script failing to return exit code zero or could it be anything else? We have identified some journald services frequently starting and stopping during the HA state change and we didnt see a return of zero exit code in the logs during that time and are investigating.

pqarmitage commented 1 month ago

If you want to check if a process is running, then the best option is to use _trackprocess which is designed for exactly that purpose.

Adi-AA commented 1 month ago

Hi @pqarmitage - We have resolved the issue ourselves. Thanks very much for your inputs in understanding how keepalived works.

We want to know if we can use a virtual-mac-address when the device fails-over like in routers (vrrp) the standby device doesn't change need to send it mac-address in a arp broadcast? The reason is every time we failover the system doesn't process traffic for 20-30 mins (intermittent) and then starts processing and we think the device B becomes master and it will update is mac address to connected switches. Anytime there is a state change the VRRP does its job of ARP broadcast in order for its neighbours to know that it owns the list of IPs.

pqarmitage commented 1 month ago

@Adi-AA It would be helpful if you could update this issue with the cause of the problem and what you did to resolve it, so that others who have similar problems might be able to learn from your experience.

You should be able to use a vmac in order to have a virtual-mac-address.

Adi-AA commented 1 month ago

Thanks @pqarmitage - We have seen high memory usage of 90% on the VMs that were having problems. This high memory is due to an API security app/sensor that takes a copy of the incoming and outgoing traffic and analyzes it for anomalies. Due to high memory consumed by one app the VMs reached 99% memory usage and stalled for a few mins resulting in the failover and back for a that time until the memory usage is normal. We have disabled that app for now and waiting update from the app team. Can you post an example of keepalived config with the use_vmac config? Is there any advantage or disadvantage to use use_vmac? Thanks! Adi

pqarmitage commented 1 month ago

To use a VMAC, change the configuration of from:

vrrp_instance VI_1 {
interface ens192
priority 101
virtual_router_id 51
advert_int 1
accept
garp_master_refresh 5
garp_master_refresh_repeat 1
unicast_src_ip 10.162.84.5
unicast_peer {
10.162.84.6
}
virtual_ipaddress {
10.162.84.7
.
.
.

to

vrrp_instance VI_1 {
interface ens192
use_vmac
priority 101
virtual_router_id 51
advert_int 1
accept
garp_master_refresh 5
garp_master_refresh_repeat 1
unicast_src_ip 10.162.84.5
unicast_peer {
10.162.84.6
}
virtual_ipaddress {
10.162.84.7
.
.
.

(i.e. add the line _usevmac)

The advantages of using a VMAC are:

  1. It confirms to the RFCs for VRRP
  2. The MAC address associated with the virtual_ipaddresses does not change on switchover from backup to master.

Since the MAC address for the VIPs doesn't change, ARP caches do not need to be updated, but any switches (referred to as "learning bridges" in the RFCs) need to know the new path to the VMAC MAC address. The VRRP RFCs state that an ARP broadcast must be sent for each VIP.

Adi-AA commented 1 month ago

Awesome! So, should the VM generate a virtual_mac_address to start with or it will be generated automatically? "The VRRP RFCs state that an ARP broadcast must be sent for each VIP". Does this happen everytime a failover occurs?

pqarmitage commented 1 month ago

keepalived has a complete implementation for using VMACs (unlike some other Linux VRRP implementations). It will create the VMAC interface, configure the correct MAC address, set the VMAC interface to the right mode, bring the interface up etc). So all you need to do to change from not using a VMAC to using a VMAC is add the use_vmac keyword; keepalived wil do the rest for you.

Whenever a VRRP instance transitions from backup to master state, it will send ARP broadcasts for each VIP (including any virtual_ipaddress_excluded). By default keepalived will send 5 GARP messages for each VIP, and then 5 seconds later send a second set of 5 for each VIP; this is probably overkill nowadays with modern switches. The parameters that control the sending of garp messages are the global: vrrp_garp_master_repeat vrrp_garp_master_delay vrrp_garp_lower_prio_repeat vrrp_garp_lower_prio_delay vrrp_garp_master_refresh vrrp_garp_master_refresh_repeat vrrp_garp_interval vrrp_gna_interval

and the corresponding keywords per VRRP instance without "vrrp_" at the beginning.

There is also the global keyword vrrp_min_garp that makes keepalived send just one GARP for each VIP after transition to master state, with no further GARPs 5 seconds later. vrrp_garp_extra_if can also be read about in keepalived.conf(5) man page. And there are also garp_groups.

heheii commented 3 weeks ago

Hi @pqarmitage - We have resolved the issue ourselves. Thanks very much for your inputs in understanding how keepalived works.

We want to know if we can use a virtual-mac-address when the device fails-over like in routers (vrrp) the standby device doesn't change need to send it mac-address in a arp broadcast? The reason is every time we failover the system doesn't process traffic for 20-30 mins (intermittent) and then starts processing and we think the device B becomes master and it will update is mac address to connected switches. Anytime there is a state change the VRRP does its job of ARP broadcast in order for its neighbours to know that it owns the list of IPs.

Why is the service occasionally unavailable for 20-30 minutes after a failover? Is the update of ARP table slow due to the shortage of server resources? I also encountered a similar problem, looking forward to your reply

pqarmitage commented 3 weeks ago

In normal operation the service should be restored within a small number of seconds, although it depends on you configuration. When a VRRP instance takes over as master, it send gratuitous ARP messages for the VIPs, and so ARP entries should be updated; I think inactive ARP entries time out in 5 minutes, so that shouldn't be the cause.

Without being able to see what is happening on your systems, what your configurations are, and what network traffic there is, it is clearly impossible for us to diagnose what is happening on your network. I suggest you inspect the keepalived logs, capture network traffic following a failover and look at where relevant packets are going and where, if anywhere they are dropped, look at ARP caches, and identify what is happening and what is not happening. All that keepalived does in respect of VRRP is add and remove IP addresses, and send ARP updates.