acassen / keepalived

Keepalived
https://www.keepalived.org
GNU General Public License v2.0
3.96k stars 736 forks source link

keepalived recv-q full and multiple masters in the cluster #1080

Closed vadapalliravikumar closed 5 years ago

vadapalliravikumar commented 5 years ago

Sometimes, the recv-q of a keepalived raw socket becomes full. When this happens, if that keepalived is a master, it stops sending vrrp hellos (as seen from tcpdump). This leads to keepalived in another node becoming a master. Thereby leading to a state where there are 2 masters in the same cluster. Issue seen in both 2.0.8 & 2.0.10 but not consistently.

keepalived version: 2.0.8 & 2.0.10

Environment: Keepalived inside docker containers managed by kubernetes

$ netstat -nap | grep keepalived
raw        0      0 0.0.0.0:112             0.0.0.0:*               7           616/keepalived  
raw   213312      0 0.0.0.0:112             0.0.0.0:*               7           616/keepalived  
$ /usr/sbin/keepalived -v
Keepalived v2.0.10 (11/12,2018)

Copyright(C) 2001-2018 Alexandre Cassen, <acassen@gmail.com>

Built with kernel headers for Linux 4.4.162
Running on Linux 4.13.0-39-generic #44~16.04.1-Ubuntu SMP Thu Apr 5 16:43:10 UTC 2018

configure options: --prefix /usr --disable-dynamic-linking

Config options:  LVS VRRP VRRP_AUTH OLD_CHKSUM_COMPAT FIB_ROUTING

System options:  PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 IPV4_DEVCONF LIBNL3 RTA_ENCAP RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS RTA_VIA FRA_OIFNAME IFA_FLAGS IP_MULTICAST_ALL LWTUNNK

Config from one node in the 3 node cluster

# cat /etc/keepalived/keepalived.conf
global_defs {
  vrrp_version 3
  vrrp_iptables KEEPALIVED-VIP
  enable_script_security
  script_user keepalived_script
}

vrrp_script node_health_check {
  script       "/node_health_check.py"
  interval 60  # check every 60 seconds
  timeout 40   # Script Timeout of 40 seconds
  fall 3       # require 3 failures for FAULT Transition
}

vrrp_instance vip_10.64.89.185 {
  state BACKUP
  interface ens192
  virtual_router_id 151
  nopreempt
  advert_int 1

  track_interface {
    ens160
  }

  virtual_ipaddress {
    10.64.89.185 dev ens160
  }

  unicast_src_ip 1.1.1.182
  unicast_peer {
    1.1.1.183
    1.1.1.184
    }

  track_script {
    node_health_check
  }
}

vrrp_instance vip_1.1.1.185 {
  state BACKUP
  interface ens192
  virtual_router_id 150
  nopreempt
  advert_int 1

  track_interface {
    ens192
  }

  virtual_ipaddress {
    1.1.1.185 dev ens192
  }

  unicast_src_ip 1.1.1.182
  unicast_peer {
    1.1.1.183
    1.1.1.184
    }

  track_script {
    node_health_check
  }
}

Keepalived command with arguments

/usr/sbin/keepalived --vrrp --dont-fork --log-console --log-detail --release-vips --pid /etc/keepalived/keepalived.pid
plantroon commented 3 years ago

It happened again. This is what I caught:

root@server:~# ss -tulpenmw | egrep '^Netid|keepalived'
Netid State  Recv-Q Send-Q                    Local Address:Port   Peer Address:Port
???   UNCONN 0      0                               0.0.0.0:112         0.0.0.0:*                                                                                users:(("keepalived",pid=18396,fd=12)) ino:35360987 sk:2c <->
???   UNCONN 214656 0                         0.0.0.0%bond0:112         0.0.0.0:*                                                                                users:(("keepalived",pid=18396,fd=11)) ino:35360986 sk:2d <->

Strace (nothing happening):

root@server:~# strace -p 18395,18396
strace: Process 18395 attached
strace: Process 18396 attached
[pid 18395] timerfd_settime(5, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=9223372036854775807, tv_nsec=0}}, NULL) = 0
[pid 18395] epoll_wait(4,

I do not have logs as nothing was logged, I'll provide them if I get anything logged....

pqarmitage commented 3 years ago

@plantroon Could you please provide the output of keepalived -v; I need to know what version of the code to be looking at.

pqarmitage commented 3 years ago

@plantroon Could you please also provide copies of your keepalived configuration files.

The strace output you have included above is for the parent process. It seems as though the VRRP process isn't doing anything, and that is the process that we are interested in.

plantroon commented 3 years ago

@pqarmitage

root@server:~# keepalived -v
Keepalived v2.0.10 (11/12,2018)

Copyright(C) 2001-2018 Alexandre Cassen, <acassen@gmail.com>

Built with kernel headers for Linux 4.18.20
Running on Linux 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07)

configure options: --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --with-kernel-dir=debian/ --enable-snmp --enable-sha1 --enable-snmp-rfcv2 --enable-snmp-rfcv3 --enable-dbus --enable-dbus-create-instance --enable-json --enable-bfd build_alias=x86_64-linux-gnu CFLAGS=-g -O2 -fdebug-prefix-map=/build/keepalived-8agDac/keepalived-2.0.10=. -fstack-protector-strong -Wformat -Werror=format-security LDFLAGS=-Wl,-z,relro CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2

Config options:  LIBIPSET_DYNAMIC LVS VRRP VRRP_AUTH JSON BFD OLD_CHKSUM_COMPAT FIB_ROUTING SNMP_V3_FOR_V2 SNMP_VRRP SNMP_CHECKER SNMP_RFCV2 SNMP_RFCV3 DBUS DBUS_CREATE_INSTANCE

System options:  PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 IPV4_DEVCONF LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS FRA_L3MDEV FRA_UID_RANGE RTAX_FASTOPEN_NO_COOKIE RTA_VIA FRA_OIFNAME FRA_PROTOCOL FRA_IP_PROTO FRA_SPORT_RANGE FRA_DPORT_RANGE RTA_TTL_PROPAGATE IFA_FLAGS IP_MULTICAST_ALL LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA LIBIPTC LIBIPSET_PRE_V7 LIBIPVS_NETLINK IPVS_DEST_ATTR_ADDR_FAMILY IPVS_SYNCD_ATTRIBUTES IPVS_64BIT_STATS VRRP_VMAC SOCK_NONBLOCK SOCK_CLOEXEC O_PATH GLOB_BRACE INET6_ADDR_GEN_MODE VRF SO_MARK SCHED_RT SCHED_RESET_ON_FORK

keepalived.conf:

vrrp_script smtpauthtest {
    script "/usr/bin/swaks --server localhost --auth-user smtptest@somedomain.ext --from=smtptest@somedomain.ext --to=smtptest@somedomain.ext --auth CRAM-MD5 -ap hoF2fi4U --port 587 --tls"
    interval 30
    fall 3
    rise 3
    timeout 10
    user system
    init_fail
}

vrrp_instance vip_wan_19 {
    state MASTER
    interface bond0
    advert_int 1
    virtual_router_id 19
    priority 100

    authentication {
        auth_type PASS
        auth_pass somepass
    }

    virtual_ipaddress {
        8.1.2.19/24 label bond0:19
        8.1.2.20/24 label bond0:20
        8.1.2.23/24 label bond0:23
        8.1.2.24/24 label bond0:24
    }

    track_script {
        smtpauthtest
    }

    notify /usr/local/bin/keepalived-notify.sh
}

The strace output you have included above is for the parent process. It seems as though the VRRP process isn't doing anything, and that is the process that we are interested in.

Yeah it is not doing anything :( The strace I took of both parent and child process (anything that had keepalive in the name)

pqarmitage commented 3 years ago

@plantroon It appears that you are running keepalived v2.0.10, which is the version against which the original problem was reported. You need to upgrade to at least v2.0.12 to include the patches that were produced to resolve the problem. Upgrading to v2.2.2 would probably be more sensible.

plantroon commented 3 years ago

@plantroon It appears that you are running keepalived v2.0.10, which is the version against which the original problem was reported. You need to upgrade to at least v2.0.12 to include the patches that were produced to resolve the problem. Upgrading to v2.2.2 would probably be more sensible.

Well, it's not even as big of a problem for me to bother upgrading it in any other way than through Debian packages. If Debian does not upgrade it in the repo for Buster, I'll be happy if it's resolved by Bullseye (which has 2.1). Shouldn't be more than a few months now.

ING-XIAOJIAN commented 1 year ago

I'm pretty sorry to bother you. I encounter this issue in the environment of production with the keepalived(V2.0.10), But i in order to resolve the problem ASAP, I upgrade this version to v2.2.7, because the environment, a production which with many high-performance servers. I worry about happening this issue again, So I want to reproduce this issue on v2.0.10 and have a deep understanding and guarantee that the latest version(v2.2.7) has resolved this issue. any ideas?

keepalived -v


Keepalived v2.0.10 (11/12,2018)

Copyright(C) 2001-2018 Alexandre Cassen, acassen@gmail.com

Built with kernel headers for Linux 4.18.0 Running on Linux 4.18.0-193.el8.x86_64 #1 SMP Fri May 8 10:59:10 UTC 2020

configure options: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-libiptc --disable-ipset --enable-snmp --enable-snmp-rfc --enable-sha1 --with-init=systemd build_alias=x86_64-redhat-linux-gnu host_alias=x86_64-redhat-linux-gnu PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig CFLAGS=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection LDFLAGS=-Wl,-z,relro -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld

Config options: LVS VRRP VRRP_AUTH OLD_CHKSUM_COMPAT FIB_ROUTING SNMP_V3_FOR_V2 SNMP_VRRP SNMP_CHECKER SNMP_RFCV2 SNMP_RFCV3

System options: PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 IPV4_DEVCONF LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS FRA_L3MDEV FRA_UID_RANGE RTAX_FASTOPEN_NO_COOKIE RTA_VIA FRA_OIFNAME FRA_PROTOCOL FRA_IP_PROTO FRA_SPORT_RANGE FRA_DPORT_RANGE RTA_TTL_PROPAGATE IFA_FLAGS IP_MULTICAST_ALL LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA LIBIPTC_LINUX_NET_IF_H_COLLISION LIBIPVS_NETLINK IPVS_DEST_ATTR_ADDR_FAMILY IPVS_SYNCD_ATTRIBUTES IPVS_64BIT_STATS VRRP_VMAC SOCK_NONBLOCK SOCK_CLOEXEC O_PATH GLOB_BRACE INET6_ADDR_GEN_MODE VRF SO_MARK SCHED_RT SCHED_RESET_ON_FORK

> keepalived.conf(master)

global_defs { router_id master_unique }

vrrp_script chk_nginx { script "/usr/local/nginx_check.sh" interval 2 weight -20 }

vrrp_instance VI_1 { state MASTER interface ens18 virtual_router_id 55 priority 202 advert_int 1 unicast_src_ip 192.168.2.105 unicast_peer{ 192.168.2.106 } authentication { auth_type PASS auth_pass xxxx }

track_script {
    chk_nginx
}
virtual_ipaddress{
    192.168.2.107/15
}

}

> system info

[root@gz-mss-soar-nginx02 log]# cat /etc/redhat-release CentOS Linux release 8.2.2004 (Core) [root@gz-mss-soar-nginx02 log]#

pqarmitage commented 1 year ago

@ING-XIAOJIAN As is explained above we never understood what caused the problem or how to reproduce it; what we did discover is that the series of patches that were applied resolved the problem for @vadapalliravikumar. We could also see that the code prior to the changes was not right and we resolved those problems.

Unfortunately the above means that we don't know what to suggest to try and prove that the problem is resolved.

We have had no further reports of this issue occurring with keepalived v2.0.12 or later in the nearly 4 years since the v2.0.12 was released, so that should give you some comfort that a known problem in v2.0.10 has been resolved. The only other option I can suggest is to try v2.2.7 and monitor it to see if your problem is resolved.

ING-XIAOJIAN commented 1 year ago

@pqarmitage I'm pretty grateful for your explanation. actually, The issue is not seen after October 24, 2022, when I upgraded keepalived version to v2.2.7, we will monitor this issue persistently.