acassen / keepalived

Keepalived
https://www.keepalived.org
GNU General Public License v2.0
3.95k stars 736 forks source link

100% cpu keepalived Centos 8 #1492

Closed obla4ko closed 3 years ago

obla4ko commented 4 years ago

Describe the bug After the restart of the process, within 1-5 days, the keepalived process begins to use 100 percent of the CPU in Userspace

To Reproduce systemctl restart keepalived wait 2-5 day

Keepalived version keepalived-2.0.10-4.el8_0.2.x86_64 Output of keepalived -v Keepalived v2.0.10 (11/12,2018)

Copyright(C) 2001-2018 Alexandre Cassen, acassen@gmail.com

Built with kernel headers for Linux 4.18.0 Running on Linux 4.18.0-80.11.2.el8_0.x86_64 #1 SMP Tue Sep 24 11:32:19 UTC 2019

configure options: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --enable-snmp --enable-snmp-rfc --enable-sha1 --with-init=systemd build_alias=x86_64-redhat-linux-gnu host_alias=x86_64-redhat-linux-gnu PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig CFLAGS=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection LDFLAGS=-Wl,-z,relro -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld

Config options: LIBIPSET_DYNAMIC LVS VRRP VRRP_AUTH OLD_CHKSUM_COMPAT FIB_ROUTING SNMP_V3_FOR_V2 SNMP_VRRP SNMP_CHECKER SNMP_RFCV2 SNMP_RFCV3

System options: PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 IPV4_DEVCONF LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS FRA_L3MDEV FRA_UID_RANGE RTAX_FASTOPEN_NO_COOKIE RTA_VIA FRA_OIFNAME FRA_PROTOCOL FRA_IP_PROTO FRA_SPORT_RANGE FRA_DPORT_RANGE RTA_TTL_PROPAGATE IFA_FLAGS IP_MULTICAST_ALL LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA LIBIPTC LIBIPSET_PRE_V7 LIBIPVS_NETLINK IPVS_DEST_ATTR_ADDR_FAMILY IPVS_SYNCD_ATTRIBUTES IPVS_64BIT_STATS VRRP_VMAC SOCK_NONBLOCK SOCK_CLOEXEC O_PATH GLOB_BRACE INET6_ADDR_GEN_MODE VRF SO_MARK SCHED_RT SCHED_RESET_ON_FORK Distro (please complete the following information): NAME="CentOS Linux" VERSION="8 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="8" PLATFORM_ID="platform:el8" PRETTY_NAME="CentOS Linux 8 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:8" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-8" CENTOS_MANTISBT_PROJECT_VERSION="8" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="8" Details of any containerisation or hosted service (e.g. AWS) If keepalived is being run in a container or on a hosted service, provide full details

Configuration file: ! Configuration File for keepalived

! configure for sflow   virtual_server 1.1.1.1 6343 {     delay_loop 6     lb_algo sh     lb_kind NAT     protocol UDP       real_server 2.2.2.2 6343 {         weight 1         MISC_CHECK {             misc_path "/opt/lb.sh 2.2.2.2”             retry 2             misc_timeout 15                     }         }        real_server 3.3.3.3 6343 {         weight 1         MISC_CHECK {             misc_path "/opt/lb.sh 3.3.3.3”             retry 2             misc_timeout 15                    }         }    

    real_server 4.4.4.4 6343 {        weight 1         MISC_CHECK {             misc_path "/opt/lb.sh 4.4.4.4”             retry 2             misc_timeout 15                    }         }          real_server 5.5.5.5 6343 {         weight 1         MISC_CHECK {             misc_path "/opt/lb.sh 5.5.5.5”             retry 2             misc_timeout 15                    }         }     } Notify and track scripts cat /opt/lb.sh

!/bin/bash

IP=$1

report=$(/usr/bin/curl -s --connect-timeout 3 --insecure --user user:password https://"${IP}"/api/gateway/1.5/system/status | grep 'sensor|diskmon|memmonitor' | grep Running | wc -l)

echo $report

if [ "$report" -eq "3" ]; then exit 0 else exit 1 fi

System Log entries Full keepalived system log entries from when keepalived started

Did keepalived coredump? debug with the help of the strace utility does not show anything on a process that is already 100% loaded

Additional context

root@host1~]# ps aux | grep keepalived
root 1883 0.0 0.0 9184 1084 pts/0 R+ 12:20 0:00 grep --color=auto keepalived
root 29456 0.0 0.0 111520 1012 ? Ss 2019 0:00 /usr/sbin/keepalived -D
root 29457 91.8 0.1 124208 6180 ? R 2019 66275:07 /usr/sbin/keepalived -D

[root@host1~]# curl --version
curl 7.61.1 (x86_64-redhat-linux-gnu) libcurl/7.61.1 OpenSSL/1.1.1 zlib/1.2.11 brotli/1.0.6 libidn2/2.0.5 libpsl/0.20.2 (+libidn2/2.0.5) libssh/0.8.5/openssl/zlib nghttp2/1.33.0
Release-Date: 2018-09-05
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp 
Features: AsynchDNS IDN IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz brotli TLS-SRP HTTP2 UnixSockets HTTPS-proxy PSL Metalink

sysctl.conf

net.core.rmem_max=26214400 net.core.rmem_default=26214400 net.ipv4.ip_forward=1 net.ipv4.vs.expire_nodest_conn=1 net.ipv4.vs.expire_quiescent_template=1

[root@ipvs-host1t ~]# systemctl status keepalived ● keepalived.service - LVS and VRRP High Availability Monitor Loaded: loaded (/usr/lib/systemd/system/keepalived.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2019-12-12 10:09:41 MSK; 1 months 19 days ago Process: 29455 ExecStart=/usr/sbin/keepalived $KEEPALIVED_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 29456 (keepalived) Tasks: 2 (limit: 24896) Memory: 10.9M CGroup: /system.slice/keepalived.service ├─29456 /usr/sbin/keepalived -D └─29457 /usr/sbin/keepalived -D

Dec 16 08:22:06 host1 Keepalived_healthcheckers[29457]: Child (PID 79470) failed to terminate after kill Dec 16 08:22:10 host1 Keepalived_healthcheckers[29457]: Child (PID 83476) failed to terminate after kill Dec 16 08:22:13 host1 Keepalived_healthcheckers[29457]: Child (PID 83488) failed to terminate after kill Dec 16 08:22:14 host1 Keepalived_healthcheckers[29457]: Child (PID 83452) failed to terminate after kill Dec 16 08:22:14 host1Keepalived_healthcheckers[29457]: Child (PID 83494) failed to terminate after kill Dec 16 08:22:16 host1 Keepalived_healthcheckers[29457]: Child (PID 79470) failed to terminate after kill Dec 16 08:22:20 host1 Keepalived_healthcheckers[29457]: Child (PID 83476) failed to terminate after kill Dec 16 08:22:23 host1 Keepalived_healthcheckers[29457]: Child (PID 83488) failed to terminate after kill Dec 16 08:22:24 host1 Keepalived_healthcheckers[29457]: Child (PID 83452) failed to terminate after kill Dec 16 08:22:24 host1 Keepalived_healthcheckers[29457]: Child (PID 83494) failed to terminate after kill [root@ihost1 ~]# ps aux | grep keepalived root 1883 0.0 0.0 9184 1084 pts/0 R+ 12:20 0:00 grep --color=auto keepalived root 29456 0.0 0.0 111520 1012 ? Ss 2019 0:00 /usr/sbin/keepalived -D root 29457 91.8 0.1 124208 6180 ? R 2019 66275:07 /usr/sbin/keepalived -D [root@host1 ~]# curl -v curl: no URL specified! curl: try 'curl --help' or 'curl --manual' for more information [root@host1 ~]# curl --version curl 7.61.1 (x86_64-redhat-linux-gnu) libcurl/7.61.1 OpenSSL/1.1.1 zlib/1.2.11 brotli/1.0.6 libidn2/2.0.5 libpsl/0.20.2 (+libidn2/2.0.5) libssh/0.8.5/openssl/zlib nghttp2/1.33.0 Release-Date: 2018-09-05 Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp Features: AsynchDNS IDN IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz brotli TLS-SRP HTTP2 UnixSockets HTTPS-proxy PSL Metalink

pqarmitage commented 4 years ago

There have been some issue with keepalived and handling termination of scripts that were resolved in v2.0.20, so can you please build and run that version and see if you still get the problem.

With the configuration you have, once keepalived has started running it is doing nothing other that running the scripts. Since the scripts have a timeout of 15 seconds and a retry time of 2 seconds, keepalived really should not be busy. This suggests that there is some problem with running the scripts.

Are the failed to terminate after kill messages being produced whil keepalived is using 100% cpu time?

I think what you will need to do when/if the problem happens again after upgrading to v2.0.20 is to 1) Attach to the process that is 100% with gdb and generate a stack backtrace. Please do this several times so that we can try and see what functions keepalived is looping in 2) Try running strace again to see if it produces any output

If the problem still persists then it will be necessary to do a debug build of keepalived so that we can get a thread dump whill keepalived is looping.

pqarmitage commented 4 years ago

See also https://groups.io/g/keepalived-users/topic/69939621#80 which appears to be a description of the same problem.

pqarmitage commented 4 years ago

@obla4ko Are you running keepalived/CentOS 8 in a virtualised or containerised environment? Could you also indicate what network driver is being used. Jens, in the email thread linked to above suggests it might relate to using the vmxnet3 driver.

szpajder commented 4 years ago

Same thing here, but on Debian Buster under VMWare, with vmxnet3.

Buster comes with keepalived 2.0.10 and it also hit me with https://github.com/acassen/keepalived/issues/1083 , so I upgraded to 2.0.19 yesterday and then found this issue, which I also observed on 2.0.10 a few times. Let's see if it makes any difference. If it doesn't, I'll try to dig into it a bit more (didn't have time so far).

pqarmitage commented 4 years ago

I am currently working on a patch to limit the size of the stack and heap, due to what was reported at https://groups.io/g/keepalived-users/topic/69939621#80 . I will also work on a patch to limit CPU utilisation, but in the mean time if you set vrrp_rt_priority and/or checker_rt_priority and/or bfd_rt_priority to 1 (depending on which keepalived process is using 100% CPU), then the CPU utilisation between system calls will be limited and should result in a SIGKILL.

fliespl commented 4 years ago

Same thing here, but on Debian Buster under VMWare, with vmxnet3.

Buster comes with keepalived 2.0.10 and it also hit me with #1083 , so I upgraded to 2.0.19 yesterday and then found this issue, which I also observed on 2.0.10 a few times. Let's see if it makes any difference. If it doesn't, I'll try to dig into it a bit more (didn't have time so far).

@szpajder Have new version been stable for you? We are struggling with same issue on Hetzner Cloud VMs.

szpajder commented 4 years ago

Yes, it's stable. Not a single occurrence of the problem since the upgrade, which took place on February 11.

fliespl commented 4 years ago

@szpajder thanks much for this info :)

dfoxg commented 4 years ago

Are there plans to update the version in debian buster? The current version https://packages.debian.org/source/buster/keepalived (2.0.10-1) is affected of the bug.

pqarmitage commented 4 years ago

@DanielFuchs98 You would need to take this up with Debian. We are upstream and it is the distros who manage what they package and what versions.

However, it should be quite straightforward for you to build your own .deb package from the latest version of keepalived and install that.

dfoxg commented 4 years ago

@pqarmitage thanks for the info. I´ve compiled it and replaced the debian version. I will give you an update in 2-3 days.

dfoxg commented 4 years ago

Looks good!

midorinet commented 3 years ago

This issue still persist on 2.1.5

If i start the keepalived on server A, then server B CPU is start peaking up until 100%

it seems like removing VIP from the servers make the CPU high

pqarmitage commented 3 years ago

@midorinet Can you please provide precise details of what you are doing.

I suspect that your problem is different from the problem that was being experienced in the reports up until May this year, since they are all reported as resolved.

In particular we will need the following information:

  1. keepalived configuration files from each of your servers.
  2. A detailed description of the sequence of events leading up to server B CPU using 100% CPU time
  3. Copies of any modified configuration files (I am not clear if "removing VIP from the servers" means that you are reloading the configuration with one or more VIPs removed, or if you are executing a command like ip addr del ...)
midorinet commented 3 years ago

please find below for the config

global_defs {
    lvs_id server01
}

vrrp_instance server01 {
    state BACKUP
    interface bond0.32
    virtual_router_id 51
    priority 114
    advert_int 2
    unicast_src_ip 192.168.1.68

    unicast_peer {
        192.168.1.69
        192.168.1.70
        192.168.1.71
    }
    virtual_ipaddress {
        192.168.100.14/32 dev bond0.165
    }
    authentication {
        auth_type PASS
        auth_pass XxXxX
    }
}

vrrp_instance server02 {
    state BACKUP
    interface bond0.32
    virtual_router_id 52
    priority 121
    advert_int 2
    unicast_src_ip 192.168.1.68

    unicast_peer {
        192.168.1.69
        192.168.1.70
        192.168.1.71
    }
    virtual_ipaddress {
        192.168.100.15/32 dev bond0.165
    }
    authentication {
        auth_type PASS
        auth_pass XxXxX
    }
}

vrrp_instance server03 {
    state BACKUP
    interface bond0.32
    virtual_router_id 53
    priority 132
    advert_int 2
    unicast_src_ip 192.168.1.68

    unicast_peer {
        192.168.1.69
        192.168.1.70
        192.168.1.71
    }
    virtual_ipaddress {
        192.168.100.16/32 dev bond0.165
    }
    authentication {
        auth_type PASS
        auth_pass XxXxX
    }
}

vrrp_instance server04 {
    state BACKUP
    interface bond0.32
    virtual_router_id 54
    priority 143
    advert_int 2
    unicast_src_ip 192.168.1.68

    unicast_peer {
        192.168.1.69
        192.168.1.70
        192.168.1.71
    }
    virtual_ipaddress {
        192.168.100.17/32 dev bond0.165
    }
    authentication {
        auth_type PASS
        auth_pass XxXxX
    }
}

The conf are same for those 4 servers, the differences is just on "state" and "priority" parts

So I stopped keepalived at server01, then VIP is moved to server02. However server01 CPU load is spiking spesificly on "si" part. Even tough I'm starting up again ke keepalived service on server01, "si" still going up

I tried to also reboot server01 instead of stopping keepalived service, but once the server is up, "si" is increasing

%Cpu(s):  1.0 us,  0.9 sy,  0.0 ni, 97.4 id,  0.0 wa,  0.0 hi,  **98,7 si**,  0.0 st

It's on Debian Buster (10.6) and KeepAlived 2.1.5

Keepalived v2.1.5 (unknown)

Copyright(C) 2001-2020 Alexandre Cassen, <acassen@gmail.com>

Built with kernel headers for Linux 4.19.146
Running on Linux 4.19.0-10-amd64 #1 SMP Debian 4.19.132-1 (2020-07-24)

configure options: 

Config options:  LVS VRRP VRRP_AUTH OLD_CHKSUM_COMPAT FIB_ROUTING

System options:  PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 IPV4_DEVCONF IPV6_ADVANCED_API RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS FRA_L3MDEV FRA_UID_RANGE RTAX_FASTOPEN_NO_COOKIE RTA_VIA FRA_OIFNAME FRA_PROTOCOL FRA_IP_PROTO FRA_SPORT_RANGE FRA_DPORT_RANGE RTA_TTL_PROPAGATE IFA_FLAGS IP_MULTICAST_ALL LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA NET_LINUX_IF_H_COLLISION LIBIPTC_LINUX_NET_IF_H_COLLISION IPVS_DEST_ATTR_ADDR_FAMILY IPVS_SYNCD_ATTRIBUTES IPVS_64BIT_STATS VRRP_VMAC VRRP_IPVLAN IFLA_LINK_NETNSID CN_PROC SOCK_NONBLOCK SOCK_CLOEXEC O_PATH GLOB_BRACE INET6_ADDR_GEN_MODE VRF SO_MARK SCHED_RESET_ON_FORK

please advice

pqarmitage commented 3 years ago

You state that the configuration is the same for all the servers except for state and priority. I presume that the unicast_src_ip and unicast_peer parts of the config are different as well.

The si value is time handling software interrupts (softirqs), and I presume executing in the context of one of the ksoftirqd processes. Whereas you show **98,7 si** should it be 98.7 (i.e. . rather than ,, or is it displayed with a comma)?

When you get the high si value, is a keepalived process still running, or have they all terminated?

My understanding is that only the kernel can raise software interrupts, so it is unlikely to be keepalived that is the problem.

You state that even if you reboot server01 you still get the very high software interrupt value. If you disable the keepalived service (so that it doesn't run after a reboot) and then reboot the system, do you still get the very high si value?

Also, do you get the same behaviour on server02, 03 or 04?

midorinet commented 3 years ago

You state that the configuration is the same for all the servers except for state and priority. I presume that the unicast_src_ip and unicast_peer parts of the config are different as well.

Yes unicast_src_ip and unicast_peer also different from each servers

The si value is time handling software interrupts (softirqs), and I presume executing in the context of one of the ksoftirqd processes. Whereas you show **98,7 si** should it be 98.7 (i.e. . rather than ,, or is it displayed with a comma)?

Sorry, it was displayed using . instead of comma %Cpu(s): 0.2 us, 0.1 sy, 0.0 ni, 25.1 id, 0.0 wa, 0.0 hi, 74.6 si, 0.0 st

When you get the high si value, is a keepalived process still running, or have they all terminated?

Yes

My understanding is that only the kernel can raise software interrupts, so it is unlikely to be keepalived that is the problem.

You state that even if you reboot server01 you still get the very high software interrupt value. If you disable the keepalived service (so that it doesn't run after a reboot) and then reboot the system, do you still get the very high si value?

If keepalived is not started, the server is fine with other services running

Also, do you get the same behaviour on server02, 03 or 04?

Yes

So i just get a new scenario

server01 stopped keepalived, si got high disable keepalived at startup at server01, then reboot server01 server01 is fine. Start keepalived at server01 manually server01 is still fine got the logs below on server02

Oct 17 09:37:09 server02 Keepalived_vrrp[15504]: (server01) Master received advert from 192.168.1.68 with higher priority 114, ours 113
Oct 17 09:37:09 server02 Keepalived_vrrp[15504]: (server01) Entering BACKUP STATE
Oct 17 09:37:09 server02 Keepalived_vrrp[15504]: (server01) removing VIPs.

server02 si got high %Cpu(s): 2.5 us, 0.6 sy, 0.0 ni, 8.9 id, 0.0 wa, 0.0 hi, 88.0 si, 0.0 st stopped keepalived on server02 verify by seeing the status

* keepalived.service - LVS and VRRP High Availability Monitor
   Loaded: loaded (/lib/systemd/system/keepalived.service; disabled; vendor preset: enabled)
   Active: inactive (dead)

Oct 17 09:37:09 server02 Keepalived_vrrp[15504]: (server01) removing VIPs.
Oct 17 09:38:47 server02 Keepalived[15503]: Stopping
Oct 17 09:38:47 server02 Keepalived_vrrp[15504]: (server02) sent 0 priority
Oct 17 09:38:47 server02 Keepalived_vrrp[15504]: (server02) removing VIPs.
Oct 17 09:38:47 server02 systemd[1]: Stopping LVS and VRRP High Availability Monitor...
Oct 17 09:38:48 server02 Keepalived_vrrp[15504]: Stopped - used 1.986353 user time, 5.916783 system time
Oct 17 09:38:48 server02 Keepalived[15503]: CPU usage (self/children) user: 0.002683/1.986353 system: 0.000000/5.920514
Oct 17 09:38:48 server02 Keepalived[15503]: Stopped Keepalived v2.1.5 (07/13,2020)
Oct 17 09:38:48 server02 systemd[1]: keepalived.service: Succeeded.
Oct 17 09:38:48 server02 systemd[1]: Stopped LVS and VRRP High Availability Monitor.

then disable keepalived on startup for server02, reboot it server02 is fine start keepalived manually on server02, then server03 si is going up

now without rebooting, since server03 si is up, i stopped keepalived at server03 waiting for about 10-15mins, si on server03 is becoming normal start keepalived on server03 manually server04 si is going up

logs on server04 when si is up

Oct 17 09:56:08 cd71 Keepalived_vrrp[1930]: (server03) Master received advert from 192.168.1.70 with higher priority 134, ours 133
Oct 17 09:56:08 server04 Keepalived_vrrp[1930]: (server03) Entering BACKUP STATE
Oct 17 09:56:08 server04 Keepalived_vrrp[1930]: (server03) removing VIPs.
Oct 17 09:56:24 server04 Keepalived[1929]: Stopping
Oct 17 09:56:24 server04 Keepalived_vrrp[1930]: (server04) sent 0 priority
Oct 17 09:56:24 server04 systemd[1]: Stopping LVS and VRRP High Availability Monitor...
Oct 17 09:56:24 server04 Keepalived_vrrp[1930]: (server04) removing VIPs.
Oct 17 09:56:25 server04 Keepalived_vrrp[1930]: Stopped - used 1.943324 user time, 5.827779 system time
Oct 17 09:56:25 server04 Keepalived[1929]: CPU usage (self/children) user: 0.000730/1.945711 system: 0.000000/5.827870
Oct 17 09:56:25 server04 Keepalived[1929]: Stopped Keepalived v2.1.5 (07/13,2020)
Oct 17 09:56:25 server04 systemd[1]: keepalived.service: Succeeded.
Oct 17 09:56:25 server04 systemd[1]: Stopped LVS and VRRP High Availability Monitor.

So far what i've done is keep stopping/starting keepalived manually, and hope until si is no more than 30. Then after that sometimes all goes well and keep si under 0.5

pqarmitage commented 3 years ago

when disable keepalived on startup for server02, reboot it server02 is fine start keepalived manually on server02, then server03 si is going up I assume at this point, the server02 vrrp instance on server03 server has transitioned to backup

now without rebooting, since server03 si is up, i stopped keepalived at server03 waiting for about 10-15mins, si on server03 is becoming normal start keepalived on server03 manually server04 si is going up I assume that when keepalived is started on server03, the server03 vrrp instance on server04 transitions to backup

From the logs above when the si is high on server04, it looks to me as though keepalived has already terminated; this means that keepalived cannot be causing the high si. When you have the high si value, can you see any process that is using CPU time? If so that might give an indication about what is happening.

I have run your configuration on exactly the same version of Debian as you, with the same kernel, and I cannot reproduce what you are seeing.

I am wondering if keepalived transitioning a vrrp instance to backup is causing something else to start misbehaving. It could be a process that monitors networking changes, such as addresses being deleted, then starts doing something strange and keeps interacting with the kernel and somehow that causes software interrupts; a process such as NetworkManager certainly monitors for network changes.

midorinet commented 3 years ago

Hi

Thanks for your explanations. However if I understand correctly, services that's running on the server doesn't do something with the network related stuff

Like i only have ntp running, node exporter for prometheus and nginx inside container. Do you think it will be one of them?

If i do tcpdump i can see other servers is trying to do vrrp, i think it's a miss configuration from their side. But if i see from the auth, it shouldn't be a problem

Based on those info, which do you think i should check further?

Thanks again for your helpfull inputs

pqarmitage commented 3 years ago

@midorinet I just need you to clarify a few points. Each of the points that I need an answer to is in bold. Without an eanswer to each of the points I won't be able to suggest how to investigate the problem further.

You wrote:

when disable keepalived on startup for server02, reboot it
server02 is fine
start keepalived manually on server02, then server03 si is going up

now without rebooting, since server03 si is up, i stopped keepalived at server03
waiting for about 10-15mins, si on server03 is becoming normal

Can you please confirm that after keepalived is stopped on server03, the si on server03 is still high.

Can you also confirm that over a period of about 10-15 minutes the si on server03 gradually reduces to becoming normal.

You have also stated above: If i do tcpdump i can see other servers is trying to do vrrp, i think it's a miss configuration from their side. But if i see from the auth, it shouldn't be a problem What are the IP addresses of the other servers? What VRIDs are each of them using? Can you please provide the keepalived configuration from the other servers I have noticed you have a log message:(server03) Master received advert from 103.6.117.70 with higher priority 134, ours 133 103.6.117.70 is not one of server01, server02, server03 or server04 from the configuration you have provided above. What is the 103.6.117.70 system? Can you please provide the keepalived configuration from 103.6.117.70

midorinet commented 3 years ago

@midorinet I just need you to clarify a few points. Each of the points that I need an answer to is in bold. Without an eanswer to each of the points I won't be able to suggest how to investigate the problem further.

You wrote:

when disable keepalived on startup for server02, reboot it
server02 is fine
start keepalived manually on server02, then server03 si is going up

now without rebooting, since server03 si is up, i stopped keepalived at server03
waiting for about 10-15mins, si on server03 is becoming normal

Can you please confirm that after keepalived is stopped on server03, the si on server03 is still high.

Yes i can confirm

Can you also confirm that over a period of about 10-15 minutes the si on server03 gradually reduces to becoming normal.

Yes i can confirm

You have also stated above: If i do tcpdump i can see other servers is trying to do vrrp, i think it's a miss configuration from their side. But if i see from the auth, it shouldn't be a problem What are the IP addresses of the other servers?

It's from 192.168.1.254 which is not on the keepalived set

What VRIDs are each of them using?

VRID 165 It's actually from cisco switch

Here's the tcpdump that I get

01:23:11.326357 IP (tos 0xc0, ttl 255, id 0, offset 0, flags [none], proto VRRP (112), length 40)
    192.168.100.1 > 224.0.0.18: vrrp 192.168.100.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 165, prio 130, authtype simple, intvl 1s, length 20, addrs: 103.50.216.254 auth "tingtengtong^@"
01:23:12.190374 IP (tos 0xc0, ttl 255, id 0, offset 0, flags [none], proto VRRP (112), length 40)
    192.168.100.1 > 224.0.0.18: vrrp 192.168.100.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 165, prio 130, authtype simple, intvl 1s, length 20, addrs: 192.168.100.254 auth "tingtengtong^@"

Can you please provide the keepalived configuration from the other servers I have noticed you have a log message:(server03) Master received advert from 103.6.117.70 with higher priority 134, ours 133 103.6.117.70 is not one of server01, server02, server03 or server04 from the configuration you have provided above. What is the 103.6.117.70 system?

Sorry that's the real ip address. On the logs i provided i changed that to 192.168.1.70

Can you please provide the keepalived configuration from 103.6.117.70

Onother curiousity, is bonded ethernet with 802.3ad method can posibbly cause this problem?

pqarmitage commented 3 years ago

@midorinet Apologies for the delay in this latest response.

There should not be any problem using 802.3ad binding; I have used it myself in the past.

From your answers above, it appears that keepalived is not the direct cause of the high si CPU time, since you say that the high si CPU time continues after keepalived exits. What seems to be the trigger for the high si CPU time starting is a VIP being removed. I don't know whether you also need to have the VIP then being added on another system to trigger the problem, or whether that is not relevant.

I suggest you try the following:

  1. Stop keepalived on all 4 systems and wait for the si CPU time to settle on all systems. Start keepalived on system01 only, wait 1 minute and then stop keepalived. Does the si CPU time become high on system01?
  2. Stop keepalived on all 4 systems and wait for the si CPU time to settle on all systems. Start keepalived on system01. 1 minute later start keepalived on system02. Does the si CPU time become high on system01 (some VIPs will have been deleted on system01)? If the si has increased, let it settle again. Stop keepalived on systemo1; does the si CPU time become high on system01?
  3. Without running keepalived on any of your 4 systems, manually adding the VIP - ip addr add 192.168.100.17/32 dev bond0.165 on system01. Wait one minute and delete the VIP - does the si CPU time become high on system01?

The above will confirm or otherwise whether removing VIPs triggers the problem.

What you could then do is add track files on each vrrp instance on each of your 4 systems. That way you can change the priorities of the vrrp instances without stopping keepalived; this means that you can cause another system to take over as master for one of the vrrp instances, which will trigger the old master removing its VIPs. You could then work through various scenarios to see what triggers the high si CPU time.

midorinet commented 3 years ago

Hi

Thanks for your advice. However since it's production environment, i will need to reproduce the same first with the traffic as well

btw i just tried to do following

  1. stopping keepalived on server01
  2. starting keepalived on server01
  3. si% start increase on server02
  4. stop keepalived on server02
  5. wait a few minutes until si is around 30
  6. adding 192.168.100.15/32 manually by ifconfig and arping on server02
  7. traffic goes back to server02
  8. si on server03 is stable
  9. start keepalived on server02
  10. si on server03 is increasing

I also do onother CPU monitoring software which is mpstat

While si is about 70, below is the output of mpstat -P ALL

04:50:46     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
04:50:46     all    0.68    0.00    0.53    0.00    0.00    0.59    0.00    0.00    0.00   98.21
04:50:46       0    0.69    0.00    0.55    0.00    0.00    2.82    0.00    0.00    0.00   95.94
04:50:46       1    0.76    0.00    0.55    0.00    0.00    1.04    0.00    0.00    0.00   97.64
04:50:46       2    0.76    0.00    0.55    0.00    0.00    0.67    0.00    0.00    0.00   98.02
04:50:46       3    0.76    0.00    0.55    0.00    0.00    0.58    0.00    0.00    0.00   98.10
04:50:46       4    0.76    0.00    0.55    0.00    0.00    0.55    0.00    0.00    0.00   98.14
04:50:46       5    0.76    0.00    0.55    0.00    0.00    0.54    0.00    0.00    0.00   98.15
04:50:46       6    0.74    0.00    0.55    0.00    0.00    0.52    0.00    0.00    0.00   98.19
04:50:46       7    0.48    0.00    0.47    0.00    0.00    0.08    0.00    0.00    0.00   98.96
04:50:46       8    0.40    0.00    0.44    0.00    0.00    0.10    0.00    0.00    0.00   99.05
04:50:46       9    0.39    0.00    0.44    0.00    0.00    0.08    0.00    0.00    0.00   99.09
04:50:46      10    0.70    0.00    0.52    0.00    0.00    0.51    0.00    0.00    0.00   98.26
04:50:46      11    0.73    0.00    0.53    0.00    0.00    0.51    0.00    0.00    0.00   98.23
04:50:46      12    0.74    0.00    0.54    0.00    0.00    0.53    0.00    0.00    0.00   98.19
04:50:46      13    0.73    0.00    0.54    0.00    0.00    0.52    0.00    0.00    0.00   98.21
04:50:46      14    0.73    0.00    0.54    0.00    0.00    0.52    0.00    0.00    0.00   98.20
04:50:46      15    0.73    0.00    0.54    0.00    0.00    0.52    0.00    0.00    0.00   98.20
04:50:46      16    0.73    0.00    0.55    0.00    0.00    0.53    0.00    0.00    0.00   98.19
04:50:46      17    0.72    0.00    0.55    0.00    0.00    0.54    0.00    0.00    0.00   98.18
04:50:46      18    0.45    0.00    0.47    0.00    0.00    0.08    0.00    0.00    0.00   99.00
04:50:46      19    0.78    0.00    0.53    0.00    0.00    0.50    0.00    0.00    0.00   98.20
midorinet commented 3 years ago

I found out on other servers cluster with the same configurations, but lower traffic (20Mbps) is ok. software interrupts on the other servers is going up around 8, then going normal again. But on my previous post, the servers is running about 150Mbps each servers

Is that might caused the issue?

pqarmitage commented 3 years ago

With a high si %age shown by top or equivalent, I would have expected to see a high %soft value with the mpstat output, so I am unclear about what is happening.

When you have the high %si, it might be worth running cat /proc/interrupts periodically and seeing if any specific device is getting a high number of interrupts.

Based on the information regarding the higher traffic rates causing the higher %si after a VIP has been removed, I wonder if after keepalived stops on server02, so 192.168.100.15 is deleted from it, server02 is still receiving traffic addressed to 192.168.100.15, even though the address is now configured on server03 by keepalived (this could happen if ARP caches on sending devices are not updated for some reason). It could be that for some reason, once 192.168.100.15 is deleted from server02, and traffic that arrives destined for that address somehow loops within server02, thereby causing high CPU utilisation. It is certainly the case that when you stop keepalived on a server, with your configuration it will take approximately 1 second before another server takes over as master.

In your post https://github.com/acassen/keepalived/issues/1492#issuecomment-713259725 point 6, you say adding 192.168.100.15/32 manually by ifconfig and arping on server02.

If the %si reduces to normal with the above tests it might support my idea in the previous paragraph.

I mentioned above that it with your configuration, at takes about 1 second for a backup to take over as master after the master stops. Can you try the following to see if they have any effect on %si (they will all reduce the take over time except point 4):

  1. Use priorities 251, 252, 253 and 254 rather than 111-114, 121-124, 131-134 and 141-144.
  2. Reduce the advert interval to 1
  3. Use VRRP version 3 instead of version 2 and reduce the advert interval to 0.1
  4. Use multicasting rather than unicasting, and use VMACs.

Once we have the information from the above tests, we might have some more idea about what is happening.

midorinet commented 3 years ago

Based on the information regarding the higher traffic rates causing the higher %si after a VIP has been removed, I wonder if after keepalived stops on server02, so 192.168.100.15 is deleted from it, server02 is still receiving traffic addressed to 192.168.100.15, even though the address is now configured on server03 by keepalived (this could happen if ARP caches on sending devices are not updated for some reason). It could be that for some reason, once 192.168.100.15 is deleted from server02, and traffic that arrives destined for that address somehow loops within server02, thereby causing high CPU utilisation. It is certainly the case that when you stop keepalived on a server, with your configuration it will take approximately 1 second before another server takes over as master.

The only think that's running is nginx on docker inside this server. I don't see any traffic coming from the access logs while VIPs has been removed

In your post #1492 (comment) point 6, you say adding 192.168.100.15/32 manually by ifconfig and arping on server02.

  • What happens to the %si on server02 when you add the address back and run arping?

%si is totally normal when adding address back and run arping manually

  • What happens to the %si if you add the address but don't run arping

Nothing happens, since arp still on server03

  • What happens if you run iptables -I INPUT -d 192.168.100.15 -j DROP and then add the address and don't run arping?

If the %si reduces to normal with the above tests it might support my idea in the previous paragraph.

I mentioned above that it with your configuration, at takes about 1 second for a backup to take over as master after the master stops. Can you try the following to see if they have any effect on %si (they will all reduce the take over time except point 4):

  1. Use priorities 251, 252, 253 and 254 rather than 111-114, 121-124, 131-134 and 141-144.
  2. Reduce the advert interval to 1
  3. Use VRRP version 3 instead of version 2 and reduce the advert interval to 0.1

I've tested this. No changes since the %si is still going up when VIPs is removed from the server. Find below for some logs

Oct 27 00:33:07 cd70 Keepalived_vrrp[5291]: (cd69) Backup received priority 0 advertisement
Oct 27 00:33:07 cd70 Keepalived_vrrp[5291]: (cd69) Receive advertisement timeout
Oct 27 00:33:07 cd70 Keepalived_vrrp[5291]: (cd69) Entering MASTER STATE
Oct 27 00:33:07 cd70 Keepalived_vrrp[5291]: (cd69) using locally configured advertisement interval (100 milli-sec)

some tcpdump

01:03:32.836700 IP (tos 0xc0, ttl 255, id 18537, offset 0, flags [none], proto VRRP (112), length 32)
    192.168.1.68 > 192.168.1.70: vrrp 192.168.1.68 > 192.168.1.70: VRRPv3, Advertisement, vrid 51, prio 254, intvl 10cs, length 12, addrs: 192.168.100.14
01:03:32.836759 IP (tos 0xc0, ttl 255, id 18348, offset 0, flags [none], proto VRRP (112), length 32)
    192.168.1.71 > 192.168.1.70: vrrp 192.168.1.71 > 192.168.1.70: VRRPv3, Advertisement, vrid 54, prio 254, intvl 10cs, length 12, addrs: 192.168.100.17
01:03:33.337155 IP (tos 0xc0, ttl 255, id 18353, offset 0, flags [none], proto VRRP (112), length 32)
    192.168.1.71 > 192.168.1.70: vrrp 192.168.1.71 > 192.168.1.70: VRRPv3, Advertisement, vrid 54, prio 254, intvl 10cs, length 12, addrs: 192.168.100.17
  1. Use multicasting rather than unicasting, and use VMACs.

Once we have the information from the above tests, we might have some more idea about what is happening.

I got something from my grafana monitoring, please find below

Screen Shot 2020-10-27 at 10 31 47

After digging more info deeper, I can found something weird here server01 VIP - 192.168.100.14 server02 VIP - 192.168.100.15 server03 VIP - 192.168.100.16 server04 VIP - 192.168.100.17

I tried to stop keepalived on server03. I did make sure that VIP is moved to server04 Then I do tcpdump on server03, I can see all VIPs traffic to 443, however I see nothing on nginx access.log

    111.68.127.186.58144 > 192.168.100.14.443: Flags [.], cksum 0xf3da (correct), seq 201, ack 96284, win 2207, options [nop,nop,TS val 20626282 ecr 3503177718], length 0
    111.68.127.186.58144 > 192.168.100.14.443: Flags [.], cksum 0xe9b6 (correct), seq 201, ack 98880, win 2207, options [nop,nop,TS val 20626282 ecr 3503177718], length 0
    125.161.130.107.24966 > 192.168.100.15.443: Flags [.], cksum 0xc1c0 (correct), seq 0, ack 61862, win 1025, length 0
    114.5.214.147.20841 > 192.168.100.14.443: Flags [.], cksum 0xc9b5 (correct), seq 253, ack 24985, win 614, options [nop,nop,TS val 4196307 ecr 683417689], length 0
    36.72.215.237.5245 > 192.168.100.14.443: Flags [.], cksum 0x610f (correct), seq 0, ack 19457, win 40948, length 0
    36.81.68.241.45066 > 192.168.100.17.443: Flags [.], cksum 0xbae8 (correct), seq 161, ack 222601, win 1996, options [nop,nop,TS val 290242855 ecr 1658820379], length 0
    36.72.215.237.5245 > 192.168.100.14.443: Flags [.], cksum 0x4fd1 (correct), seq 0, ack 24577, win 40242, length 0
    36.69.13.47.26555 > 192.168.100.14.443: Flags [.], cksum 0x7d96 (correct), seq 0, ack 3883, win 1024, options [nop,nop,TS val 3710721887 ecr 3552651236], length 0
    103.10.169.29.51388 > 192.168.100.14.443: Flags [.], cksum 0x737e (correct), seq 158, ack 221921, win 450, length 0

Is this kind of think normal?

pqarmitage commented 3 years ago

Can you please rerun the last test above running tcpdump with the --no-promiscuous-mode option and also the -e option. We need to make sure that tcpdump is not putting the interface into promiscuous mode. It would be helpful if you could also provide the MAC address of the interface in use on server03.

midorinet commented 3 years ago

Can you please rerun the last test above running tcpdump with the --no-promiscuous-mode option and also the -e option. We need to make sure that tcpdump is not putting the interface into promiscuous mode. It would be helpful if you could also provide the MAC address of the interface in use on server03.

When I ran with --no-promiscuous-mode -e there's no output just like before

the MAC address is 70:10:6f:c3:cf:3e

pqarmitage commented 3 years ago

Could you please now try running the tcpdump WITHOUT --no-promiscuous-mode but WITH -e immediately after keepalived is stopped on server03.

midorinet commented 3 years ago

Could you please now try running the tcpdump WITHOUT --no-promiscuous-mode but WITH -e immediately after keepalived is stopped on server03.

I ran that, keepalived stopped on server03

08:38:36.825750 88:5a:92:0b:72:bf > 70:10:6f:c3:cf:3a, ethertype 802.1Q (0x8100), length 70: vlan 165, p 0, ethertype IPv4, (tos 0x0, ttl 55, id 58946, offset 0, flags [DF], proto TCP (6), length 52)
    36.68.14.139.36892 > 192.168.100.14.443: Flags [.], cksum 0xc10a (correct), seq 0, ack 77023, win 1195, options [nop,nop,TS val 11148997 ecr 330741360], length 0
08:38:36.826135 88:5a:92:0b:72:bf > 70:10:6f:c3:cf:3a, ethertype 802.1Q (0x8100), length 70: vlan 165, p 0, ethertype IPv4, (tos 0x88, ttl 55, id 20040, offset 0, flags [DF], proto TCP (6), length 52)
    202.67.40.199.15381 > 192.168.100.14.443: Flags [.], cksum 0x3a4c (correct), seq 227, ack 110641, win 8187, options [nop,nop,TS val 76474609 ecr 3436684347], length 0
08:38:36.826139 88:5a:92:0b:72:bf > 70:10:6f:c3:cf:3a, ethertype 802.1Q (0x8100), length 70: vlan 165, p 0, ethertype IPv4, (tos 0x88, ttl 55, id 20038, offset 0, flags [DF], proto TCP (6), length 52)
    202.67.40.199.15381 > 192.168.100.14.443: Flags [.], cksum 0x3a4c (correct), seq 227, ack 110641, win 8187, options [nop,nop,TS val 76474609 ecr 3436684347], length 0
08:38:36.826140 88:5a:92:0b:72:bf > 70:10:6f:c3:cf:3a, ethertype 802.1Q (0x8100), length 70: vlan 165, p 0, ethertype IPv4, (tos 0x88, ttl 55, id 20039, offset 0, flags [DF], proto TCP (6), length 52)
    202.67.40.199.15381 > 192.168.100.14.443: Flags [.], cksum 0x3a4c (correct), seq 227, ack 110641, win 8187, options [nop,nop,TS val 76474609 ecr 3436684347], length 0
08:38:36.826141 88:5a:92:0b:72:bf > 70:10:6f:c3:cf:3a, ethertype 802.1Q (0x8100), length 70: vlan 165, p 0, ethertype IPv4, (tos 0x88, ttl 55, id 20041, offset 0, flags [DF], proto TCP (6), length 52)
    202.67.40.199.15381 > 192.168.100.14.443: Flags [.], cksum 0x3a4c (correct), seq 227, ack 110641, win 8187, options [nop,nop,TS val 76474609 ecr 3436684347], length 0
08:38:36.826183 88:5a:92:0b:72:bf > 70:10:6f:c3:cf:3a, ethertype 802.1Q (0x8100), length 70: vlan 165, p 0, ethertype IPv4, (tos 0x88, ttl 55, id 20042, offset 0, flags [DF], proto TCP (6), length 52)

I can see traffic from 192.168.100.14, but none for other VIPs

pqarmitage commented 3 years ago

Was server03 master for VIP 192.168.100.14 before you stopped keepalived?

I think this is where your problem lies. Since keepalived is no longer running on server03, VIP 192.168.14.443 is no longer configured on server03. The question is: Why is the device (? router) with MAC address 88:5a:92:0b:72:bf still sending packets addressed to 192.168.100.14 to server03? It suggests that the ARP cache on 88:5a:92:0b:72:bf is not being updated when another server takes over as master and sends the gratuitous ARP messages for 192.168.100.14.

I think what you need to do is run tcpdump with the --no-promiscuous-mode -e options when you have the high %si value to see if the server, once keepalived has stopped continues to receive packets addressed to 192.168.100.1[4567]. As the %si gradually decreases does the rate of receiving those packets also decrease to the point that when the %si is back to normal the server has stopped receiving packets addressed to 192.168.100.1[4567]?

I think what may help would be to specify use_vmac for each vrrp instance. That means that the destination address for each VIP will not change when another vrrp instance takes over as master, and it also means that the VIP will be removed from the server when keepalived stops, and so that server will no longer receive the packets. Being able to do this with keepalived, both when using unicast and also when the VIPs are configured on a different interface from the interface of the VRRP instance itself is a very new feature. You will need to use the source code up to at least commit b51c9ad (commit a68e8a8 would be better) and build your own version of keepalived.

That still leaves the issue of why device 88:5a:92:0b:72:bf keeps sending the packets to server03 after another server has taken over the VIP and presumably has sent gratuitous ARP messages for the VIP, but that really is an issue to do with your network devices.

By the way, there is not global_def keyword lvs_id. keepalived will be logging this error every time it starts; it would be worth looking that the system logs regularly to see error messages are being logged, especially for keepalived.

midorinet commented 3 years ago

Hi

However 88:5a:92:0b:72:bf is the mac address of server01

So i assume it's between keepalived instances

pqarmitage commented 3 years ago

Was server03 master for VIP 192.168.100.14 before you stopped keepalived?

I think this is where your problem lies. Since keepalived is no longer running on server03, VIP 192.168.14.443 is no longer configured on server03. The question is: Why is the device (? router) with MAC address 88:5a:92:0b:72:bf still sending packets addressed to 192.168.100.14 to server03? It suggests that the ARP cache on 88:5a:92:0b:72:bf is not being updated when another server takes over as master and sends the gratuitous ARP messages for 192.168.100.14.

I think what you need to do is run tcpdump with the --no-promiscuous-mode -e options when you have the high %si value to see if the server, once keepalived has stopped continues to receive packets addressed to 192.168.100.1[4567]. As the %si gradually decreases does the rate of receiving those packets also decrease to the point that when the %si is back to normal the server has stopped receiving packets addressed to 192.168.100.1[4567]?

I think what may help would be to specify use_vmac for each vrrp instance. That means that the destination address for each VIP will not change when another vrrp instance takes over as master, and it also means that the VIP will be removed from the server when keepalived stops, and so that server will no longer receive the packets. Being able to do this with keepalived, both when using unicast and also when the VIPs are configured on a different interface from the interface of the VRRP instance itself is a very new feature. You will need to use the source code up to at least commit b51c9ad (commit a68e8a8 would be better) and build your own version of keepalived.

That still leaves the issue of why device 88:5a:92:0b:72:bf keeps sending the packets to server03 after another server has taken over the VIP and presumably has sent gratuitous ARP messages for the VIP, but that really is an issue to do with your network devices.

By the way, there is not global_def keyword lvs_id. keepalived will be logging this error every time it starts; it would be worth looking that the system logs regularly to see error messages are being logged, especially for keepalived.

midorinet commented 3 years ago

Was server03 master for VIP 192.168.100.14 before you stopped keepalived?

No, server03 master was 192.168.100.16

I think this is where your problem lies. Since keepalived is no longer running on server03, VIP 192.168.14.443 is no longer configured on server03. The question is: Why is the device (? router) with MAC address 88:5a:92:0b:72:bf still sending packets addressed to 192.168.100.14 to server03? It suggests that the ARP cache on 88:5a:92:0b:72:bf is not being updated when another server takes over as master and sends the gratuitous ARP messages for 192.168.100.14.

That mac address belongs to server01

I think what you need to do is run tcpdump with the --no-promiscuous-mode -e options when you have the high %si value to see if the server, once keepalived has stopped continues to receive packets addressed to 192.168.100.1[4567]. As the %si gradually decreases does the rate of receiving those packets also decrease to the point that when the %si is back to normal the server has stopped receiving packets addressed to 192.168.100.1[4567]?

Will check and update you on this

I think what may help would be to specify use_vmac for each vrrp instance. That means that the destination address for each VIP will not change when another vrrp instance takes over as master, and it also means that the VIP will be removed from the server when keepalived stops, and so that server will no longer receive the packets. Being able to do this with keepalived, both when using unicast and also when the VIPs are configured on a different interface from the interface of the VRRP instance itself is a very new feature. You will need to use the source code up to at least commit b51c9ad (commit a68e8a8 would be better) and build your own version of keepalived.

That still leaves the issue of why device 88:5a:92:0b:72:bf keeps sending the packets to server03 after another server has taken over the VIP and presumably has sent gratuitous ARP messages for the VIP, but that really is an issue to do with your network devices.

From tcpdump, as previously mentioned that server01 is sending packet to server03 instead of a router

By the way, there is not global_def keyword lvs_id. keepalived will be logging this error every time it starts; it would be worth looking that the system logs regularly to see error messages are being logged, especially for keepalived.

I have removed lvs_id from the config for the last 8 days. Well noted about this

Please kindly advice what may caused this

pqarmitage commented 3 years ago

@midorinet I think what you need to do now is work out what devices are sending what packets to what destinations (both MAC address and IP address) both while keepalived is running on all 4 servers, and again once you have stopped keepalived on a server. Any packets addressed to a VIP IP address arriving on a server once keepalived has stopped are clearly being sent to the wrong destination. You will then need to look at the sender of those packets to see why they are being sent to the wrong destination (ip neighbour show might help).

You might also want to check that when a server becomes master for a VRRP instance it is sending the appropriate gratuitous ARP messages, and that those messages are being received by all the appropriate destinations and their ARP caches are being updated.

This problem seems to be way beyond a keepalived issue now, and seems to relate to how your network is performing.

pqarmitage commented 3 years ago

@midorinet I am closing this now since your issue does not appear to be a keepalived issue. If you find more keepalived problems relating to this please update this issue and we can reopen it.