Closed vadapalliravikumar closed 5 years ago
It happened again. This is what I caught:
root@server:~# ss -tulpenmw | egrep '^Netid|keepalived'
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port
??? UNCONN 0 0 0.0.0.0:112 0.0.0.0:* users:(("keepalived",pid=18396,fd=12)) ino:35360987 sk:2c <->
??? UNCONN 214656 0 0.0.0.0%bond0:112 0.0.0.0:* users:(("keepalived",pid=18396,fd=11)) ino:35360986 sk:2d <->
Strace (nothing happening):
root@server:~# strace -p 18395,18396
strace: Process 18395 attached
strace: Process 18396 attached
[pid 18395] timerfd_settime(5, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=9223372036854775807, tv_nsec=0}}, NULL) = 0
[pid 18395] epoll_wait(4,
I do not have logs as nothing was logged, I'll provide them if I get anything logged....
@plantroon Could you please provide the output of keepalived -v
; I need to know what version of the code to be looking at.
@plantroon Could you please also provide copies of your keepalived configuration files.
The strace output you have included above is for the parent process. It seems as though the VRRP process isn't doing anything, and that is the process that we are interested in.
@pqarmitage
root@server:~# keepalived -v
Keepalived v2.0.10 (11/12,2018)
Copyright(C) 2001-2018 Alexandre Cassen, <acassen@gmail.com>
Built with kernel headers for Linux 4.18.20
Running on Linux 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07)
configure options: --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --with-kernel-dir=debian/ --enable-snmp --enable-sha1 --enable-snmp-rfcv2 --enable-snmp-rfcv3 --enable-dbus --enable-dbus-create-instance --enable-json --enable-bfd build_alias=x86_64-linux-gnu CFLAGS=-g -O2 -fdebug-prefix-map=/build/keepalived-8agDac/keepalived-2.0.10=. -fstack-protector-strong -Wformat -Werror=format-security LDFLAGS=-Wl,-z,relro CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2
Config options: LIBIPSET_DYNAMIC LVS VRRP VRRP_AUTH JSON BFD OLD_CHKSUM_COMPAT FIB_ROUTING SNMP_V3_FOR_V2 SNMP_VRRP SNMP_CHECKER SNMP_RFCV2 SNMP_RFCV3 DBUS DBUS_CREATE_INSTANCE
System options: PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 IPV4_DEVCONF LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS FRA_L3MDEV FRA_UID_RANGE RTAX_FASTOPEN_NO_COOKIE RTA_VIA FRA_OIFNAME FRA_PROTOCOL FRA_IP_PROTO FRA_SPORT_RANGE FRA_DPORT_RANGE RTA_TTL_PROPAGATE IFA_FLAGS IP_MULTICAST_ALL LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA LIBIPTC LIBIPSET_PRE_V7 LIBIPVS_NETLINK IPVS_DEST_ATTR_ADDR_FAMILY IPVS_SYNCD_ATTRIBUTES IPVS_64BIT_STATS VRRP_VMAC SOCK_NONBLOCK SOCK_CLOEXEC O_PATH GLOB_BRACE INET6_ADDR_GEN_MODE VRF SO_MARK SCHED_RT SCHED_RESET_ON_FORK
keepalived.conf:
vrrp_script smtpauthtest {
script "/usr/bin/swaks --server localhost --auth-user smtptest@somedomain.ext --from=smtptest@somedomain.ext --to=smtptest@somedomain.ext --auth CRAM-MD5 -ap hoF2fi4U --port 587 --tls"
interval 30
fall 3
rise 3
timeout 10
user system
init_fail
}
vrrp_instance vip_wan_19 {
state MASTER
interface bond0
advert_int 1
virtual_router_id 19
priority 100
authentication {
auth_type PASS
auth_pass somepass
}
virtual_ipaddress {
8.1.2.19/24 label bond0:19
8.1.2.20/24 label bond0:20
8.1.2.23/24 label bond0:23
8.1.2.24/24 label bond0:24
}
track_script {
smtpauthtest
}
notify /usr/local/bin/keepalived-notify.sh
}
The strace output you have included above is for the parent process. It seems as though the VRRP process isn't doing anything, and that is the process that we are interested in.
Yeah it is not doing anything :( The strace I took of both parent and child process (anything that had keepalive in the name)
@plantroon It appears that you are running keepalived v2.0.10, which is the version against which the original problem was reported. You need to upgrade to at least v2.0.12 to include the patches that were produced to resolve the problem. Upgrading to v2.2.2 would probably be more sensible.
@plantroon It appears that you are running keepalived v2.0.10, which is the version against which the original problem was reported. You need to upgrade to at least v2.0.12 to include the patches that were produced to resolve the problem. Upgrading to v2.2.2 would probably be more sensible.
Well, it's not even as big of a problem for me to bother upgrading it in any other way than through Debian packages. If Debian does not upgrade it in the repo for Buster, I'll be happy if it's resolved by Bullseye (which has 2.1). Shouldn't be more than a few months now.
I'm pretty sorry to bother you. I encounter this issue in the environment of production with the keepalived(V2.0.10), But i in order to resolve the problem ASAP, I upgrade this version to v2.2.7, because the environment, a production which with many high-performance servers. I worry about happening this issue again, So I want to reproduce this issue on v2.0.10 and have a deep understanding and guarantee that the latest version(v2.2.7) has resolved this issue. any ideas?
keepalived -v
Keepalived v2.0.10 (11/12,2018)
Copyright(C) 2001-2018 Alexandre Cassen, acassen@gmail.com
Built with kernel headers for Linux 4.18.0 Running on Linux 4.18.0-193.el8.x86_64 #1 SMP Fri May 8 10:59:10 UTC 2020
configure options: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-libiptc --disable-ipset --enable-snmp --enable-snmp-rfc --enable-sha1 --with-init=systemd build_alias=x86_64-redhat-linux-gnu host_alias=x86_64-redhat-linux-gnu PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig CFLAGS=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection LDFLAGS=-Wl,-z,relro -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld
Config options: LVS VRRP VRRP_AUTH OLD_CHKSUM_COMPAT FIB_ROUTING SNMP_V3_FOR_V2 SNMP_VRRP SNMP_CHECKER SNMP_RFCV2 SNMP_RFCV3
System options: PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 IPV4_DEVCONF LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS FRA_L3MDEV FRA_UID_RANGE RTAX_FASTOPEN_NO_COOKIE RTA_VIA FRA_OIFNAME FRA_PROTOCOL FRA_IP_PROTO FRA_SPORT_RANGE FRA_DPORT_RANGE RTA_TTL_PROPAGATE IFA_FLAGS IP_MULTICAST_ALL LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA LIBIPTC_LINUX_NET_IF_H_COLLISION LIBIPVS_NETLINK IPVS_DEST_ATTR_ADDR_FAMILY IPVS_SYNCD_ATTRIBUTES IPVS_64BIT_STATS VRRP_VMAC SOCK_NONBLOCK SOCK_CLOEXEC O_PATH GLOB_BRACE INET6_ADDR_GEN_MODE VRF SO_MARK SCHED_RT SCHED_RESET_ON_FORK
> keepalived.conf(master)
global_defs { router_id master_unique }
vrrp_script chk_nginx { script "/usr/local/nginx_check.sh" interval 2 weight -20 }
vrrp_instance VI_1 { state MASTER interface ens18 virtual_router_id 55 priority 202 advert_int 1 unicast_src_ip 192.168.2.105 unicast_peer{ 192.168.2.106 } authentication { auth_type PASS auth_pass xxxx }
track_script {
chk_nginx
}
virtual_ipaddress{
192.168.2.107/15
}
}
> system info
[root@gz-mss-soar-nginx02 log]# cat /etc/redhat-release CentOS Linux release 8.2.2004 (Core) [root@gz-mss-soar-nginx02 log]#
@ING-XIAOJIAN As is explained above we never understood what caused the problem or how to reproduce it; what we did discover is that the series of patches that were applied resolved the problem for @vadapalliravikumar. We could also see that the code prior to the changes was not right and we resolved those problems.
Unfortunately the above means that we don't know what to suggest to try and prove that the problem is resolved.
We have had no further reports of this issue occurring with keepalived v2.0.12 or later in the nearly 4 years since the v2.0.12 was released, so that should give you some comfort that a known problem in v2.0.10 has been resolved. The only other option I can suggest is to try v2.2.7 and monitor it to see if your problem is resolved.
@pqarmitage I'm pretty grateful for your explanation. actually, The issue is not seen after October 24, 2022, when I upgraded keepalived version to v2.2.7, we will monitor this issue persistently.
Sometimes, the recv-q of a keepalived raw socket becomes full. When this happens, if that keepalived is a master, it stops sending vrrp hellos (as seen from tcpdump). This leads to keepalived in another node becoming a master. Thereby leading to a state where there are 2 masters in the same cluster. Issue seen in both 2.0.8 & 2.0.10 but not consistently.
keepalived version: 2.0.8 & 2.0.10
Environment: Keepalived inside docker containers managed by kubernetes
Config from one node in the 3 node cluster
Keepalived command with arguments