keepalived recv-q full and multiple masters in the cluster

vadapalliravikumar commented 5 years ago

Sometimes, the recv-q of a keepalived raw socket becomes full. When this happens, if that keepalived is a master, it stops sending vrrp hellos (as seen from tcpdump). This leads to keepalived in another node becoming a master. Thereby leading to a state where there are 2 masters in the same cluster. Issue seen in both 2.0.8 & 2.0.10 but not consistently.

keepalived version: 2.0.8 & 2.0.10

Environment: Keepalived inside docker containers managed by kubernetes

$ netstat -nap | grep keepalived
raw        0      0 0.0.0.0:112             0.0.0.0:*               7           616/keepalived  
raw   213312      0 0.0.0.0:112             0.0.0.0:*               7           616/keepalived

$ /usr/sbin/keepalived -v
Keepalived v2.0.10 (11/12,2018)

Copyright(C) 2001-2018 Alexandre Cassen, <acassen@gmail.com>

Built with kernel headers for Linux 4.4.162
Running on Linux 4.13.0-39-generic #44~16.04.1-Ubuntu SMP Thu Apr 5 16:43:10 UTC 2018

configure options: --prefix /usr --disable-dynamic-linking

Config options:  LVS VRRP VRRP_AUTH OLD_CHKSUM_COMPAT FIB_ROUTING

System options:  PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 IPV4_DEVCONF LIBNL3 RTA_ENCAP RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS RTA_VIA FRA_OIFNAME IFA_FLAGS IP_MULTICAST_ALL LWTUNNK

Config from one node in the 3 node cluster

# cat /etc/keepalived/keepalived.conf
global_defs {
  vrrp_version 3
  vrrp_iptables KEEPALIVED-VIP
  enable_script_security
  script_user keepalived_script
}

vrrp_script node_health_check {
  script       "/node_health_check.py"
  interval 60  # check every 60 seconds
  timeout 40   # Script Timeout of 40 seconds
  fall 3       # require 3 failures for FAULT Transition
}

vrrp_instance vip_10.64.89.185 {
  state BACKUP
  interface ens192
  virtual_router_id 151
  nopreempt
  advert_int 1

  track_interface {
    ens160
  }

  virtual_ipaddress {
    10.64.89.185 dev ens160
  }

  unicast_src_ip 1.1.1.182
  unicast_peer {
    1.1.1.183
    1.1.1.184
    }

  track_script {
    node_health_check
  }
}

vrrp_instance vip_1.1.1.185 {
  state BACKUP
  interface ens192
  virtual_router_id 150
  nopreempt
  advert_int 1

  track_interface {
    ens192
  }

  virtual_ipaddress {
    1.1.1.185 dev ens192
  }

  unicast_src_ip 1.1.1.182
  unicast_peer {
    1.1.1.183
    1.1.1.184
    }

  track_script {
    node_health_check
  }
}

Keepalived command with arguments

/usr/sbin/keepalived --vrrp --dont-fork --log-console --log-detail --release-vips --pid /etc/keepalived/keepalived.pid

vadapalliravikumar commented 5 years ago

A workaround will unblock us for the moment. Please let me know if there is a workaround.

pqarmitage commented 5 years ago

I think the cause and effect are the other way around from what you suggest. It looks to me as though this is the socket use for receiving (we have previously had problems with the receive queue filling up on the socket used for sending, but the receive buffer size on the send socket is now set much lower that the default 213312).

What I think is happening is that the keepalived process for some reason is stopping processing received packets, and this causes the receive queue to fill up.

Could you please post the logs of keepalived (preferaby run with the -d option) so that we can see if it is reporting any problem at the time the receive queue starts filling up.

It would also be useful to have a look on the system where the receive queue has filled up with tcpdump or wireshark to see if it is sending any packets.

vadapalliravikumar commented 5 years ago

keepalived config of the node from which the logs & tcpdump are captured (note that this node is different from the node whose config is shown in the 1st comment in this issue)

global_defs {
  vrrp_version 3
  vrrp_iptables KEEPALIVED-VIP
  enable_script_security
  script_user keepalived_script
}

vrrp_script node_health_check {
  script       "/node_health_check.py"
  interval 60  # check every 60 seconds
  timeout 40   # Script Timeout of 40 seconds
  fall 3       # require 3 failures for FAULT Transition
}

vrrp_instance vip_10.64.89.185 {
  state BACKUP
  interface ens192
  virtual_router_id 151
  nopreempt
  advert_int 1

  track_interface {
    ens160
  }

  virtual_ipaddress {
    10.64.89.185 dev ens160
  }

  unicast_src_ip 1.1.1.183
  unicast_peer {
    1.1.1.182
    1.1.1.184
    }

  track_script {
    node_health_check
  }
}

vrrp_instance vip_1.1.1.185 {
  state BACKUP
  interface ens192
  virtual_router_id 150
  nopreempt
  advert_int 1

  track_interface {
    ens192
  }

  virtual_ipaddress {
    1.1.1.185 dev ens192
  }

  unicast_src_ip 1.1.1.183
  unicast_peer {
    1.1.1.182
    1.1.1.184
    }

  track_script {
    node_health_check
  }
}

keepalived logs (without -d option). Will try to capture the logs with -d also.

Thu Dec  6 03:22:47 2018: Starting Keepalived v2.0.10 (11/12,2018)
Thu Dec  6 03:22:47 2018: Running on Linux 4.13.0-39-generic #44~16.04.1-Ubuntu SMP Thu Apr 5 16:43:10 UTC 2018 (built for Linux 4.4.162)
Thu Dec  6 03:22:47 2018: Command line: '/usr/sbin/keepalived' '--vrrp' '--dont-fork' '--log-console' '--log-detail'
Thu Dec  6 03:22:47 2018:               '--release-vips' '--pid' '/etc/keepalived/keepalived.pid' '&'
Thu Dec  6 03:22:47 2018: Opening file '/etc/keepalived/keepalived.conf'.
Thu Dec  6 03:22:47 2018: Starting VRRP child process, pid=3923
Thu Dec  6 03:22:47 2018: Registering Kernel netlink reflector
Thu Dec  6 03:22:47 2018: Registering Kernel netlink command channel
Thu Dec  6 03:22:47 2018: Opening file '/etc/keepalived/keepalived.conf'.
Thu Dec  6 03:22:47 2018: Assigned address 1.1.1.183 for interface ens192
Thu Dec  6 03:22:47 2018: Assigned address fe80::250:56ff:fe8e:1b67 for interface ens192
Thu Dec  6 03:22:47 2018: (vip_1.1.1.185) Ignoring track_interface ens192 since own interface
Thu Dec  6 03:22:47 2018: Registering gratuitous ARP shared channel
Thu Dec  6 03:22:47 2018: (vip_10.64.89.185) removing VIPs.
Thu Dec  6 03:22:47 2018: (vip_1.1.1.185) removing VIPs.
Thu Dec  6 03:22:47 2018: (vip_10.64.89.185) removing VIPs.
Thu Dec  6 03:22:47 2018: (vip_1.1.1.185) removing VIPs.
Thu Dec  6 03:22:47 2018: VRRP sockpool: [ifindex(3), family(IPv4), proto(112), unicast(1), fd(9,10)]
Thu Dec  6 03:22:51 2018: VRRP_Script(node_health_check) succeeded
Thu Dec  6 03:22:51 2018: (vip_10.64.89.185) Entering BACKUP STATE
Thu Dec  6 03:22:51 2018: (vip_1.1.1.185) Entering BACKUP STATE

keepalived conf

root@maglev-master-1:/# /usr/sbin/keepalived -v
Keepalived v2.0.10 (11/12,2018)

Copyright(C) 2001-2018 Alexandre Cassen, <acassen@gmail.com>

Built with kernel headers for Linux 4.4.162
Running on Linux 4.13.0-39-generic #44~16.04.1-Ubuntu SMP Thu Apr 5 16:43:10 UTC 2018

configure options: --prefix /usr --disable-dynamic-linking

Config options:  LVS VRRP VRRP_AUTH OLD_CHKSUM_COMPAT FIB_ROUTING

System options:  PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 IPV4_DEVCONF LIBNL3 RTA_ENCAP RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS RTA_VIA FRA_OIFNAME IFA_FLAGS IP_MULTICAST_ALL LWTUNNK
root@maglev-master-1:/#

tcpdump shows 0 packets

$ sudo netstat -anp | grep keepalived
raw        0      0 0.0.0.0:112             0.0.0.0:*               7           21700/keepalived
raw   213312      0 0.0.0.0:112             0.0.0.0:*               7           21700/keepalived

[Fri Dec 07 19:05:41 UTC] maglev@10.64.89.183 (maglev-master-1) ~
$ 

[Fri Dec 07 19:05:42 UTC] maglev@10.64.89.183 (maglev-master-1) ~
$ sudo tcpdump -i any -ln 'proto 112'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes

^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel

[Fri Dec 07 19:06:03 UTC] maglev@10.64.89.183 (maglev-master-1) ~
$

pqarmitage commented 5 years ago

The configuration works for me. It appears as though, for some reason, keepalived is stopping processing anything, since after entering backup state it neither times out to become master, nor does it receive any VRRP packets.

If you could try a couple of things it would be helpful:

Can you attach to the keepalived vrrp process with gdb and generate a backtrace and post it here.
Are you able to try building keepalived on a system with 4.13 kernel headers, so that keepalived matches the kernel it is running on.

vadapalliravikumar commented 5 years ago

gdb, bt details:

root@maglev-master-1:/# ps -aefww
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 Dec06 ?        00:00:00 /pause
root      3361     0  0 Dec06 ?        00:23:43 /usr/bin/python /keepalived.py
root      6321  3361  0 Dec08 ?        00:00:00 /usr/sbin/keepalived --vrrp --dont-fork --log-console --log-detail --release-vips --pid /etc/keepalived/keepalived.pid &
root      6322  6321 19 Dec08 ?        15:33:54 /usr/sbin/keepalived --vrrp --dont-fork --log-console --log-detail --release-vips --pid /etc/keepalived/keepalived.pid &
keepali+  9347  6322  0 Dec08 ?        00:00:00 [node_health_che] <defunct>
root      9374     0  0 18:42 ?        00:00:00 bash
root      9920  9374  0 18:57 ?        00:00:00 ps -aefww

gdb attach to process 6321

root@maglev-master-1:/# gdb --pid=6321
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 6321
Reading symbols from /usr/sbin/keepalived...done.
Reading symbols from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libssl.so.1.0.0...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libnl-genl-3.so.200...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libnl-3.so.200...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libc-2.23.so...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libdl-2.23.so...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...Reading symbols from /usr/lib/debug/.build-id/ce/17e023542265fc11d9bc8f534bb4f070493d30.debug...done.
done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libm-2.23.so...done.
done.
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.23.so...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/libnss_compat.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libnss_compat-2.23.so...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/libnsl.so.1...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libnsl-2.23.so...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/libnss_nis.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libnss_nis-2.23.so...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/libnss_files.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libnss_files-2.23.so...done.
done.
0x00007f4c6be509b3 in epoll_ctl () at ../sysdeps/unix/syscall-template.S:84
84      ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) 
(gdb) bt
#0  0x00007f4c6be509b3 in epoll_ctl () at ../sysdeps/unix/syscall-template.S:84
#1  0x00005611351f6ab9 in thread_fetch_next_queue (m=0x561135e6c750)
    at scheduler.c:1574
#2  process_threads (m=0x561135e6c750) at scheduler.c:1691
#3  0x00005611351f6f31 in launch_thread_scheduler (m=<optimized out>)
    at scheduler.c:1815
#4  0x00005611351b928b in keepalived_main (argc=9, argv=0x7ffc2d6ff748)
    at main.c:1897
#5  0x00007f4c6bd69830 in __libc_start_main (main=0x5611351b73d0 <main>, 
    argc=9, argv=0x7ffc2d6ff748, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7ffc2d6ff738)
    at ../csu/libc-start.c:291
#6  0x00005611351b7409 in _start ()
(gdb) quit
A debugging session is active.

        Inferior 1 [process 6321] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/sbin/keepalived, process 6321

gdb attach to process 6322

root@maglev-master-1:/# gdb --pid=6322
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 6322
Reading symbols from /usr/sbin/keepalived...done.
Reading symbols from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libssl.so.1.0.0...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libnl-genl-3.so.200...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libnl-3.so.200...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libc-2.23.so...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libdl-2.23.so...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...Reading symbols from /usr/lib/debug/.build-id/ce/17e023542265fc11d9bc8f534bb4f070493d30.debug...done.
done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libm-2.23.so...done.
done.
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.23.so...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/libnss_compat.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libnss_compat-2.23.so...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/libnsl.so.1...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libnsl-2.23.so...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/libnss_nis.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libnss_nis-2.23.so...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/libnss_files.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libnss_files-2.23.so...done.
done.
0x00005611351f64de in thread_add_child (m=m@entry=0x561135e6c750, 
    func=func@entry=0x5611351def40 <vrrp_script_child_thread>, 
    arg=arg@entry=0x561135e6d760, pid=9347, timer=timer@entry=40000000)
    at scheduler.c:1250
1250    scheduler.c: No such file or directory.
(gdb) 
(gdb) bt
#0  0x00005611351f64de in thread_add_child (m=m@entry=0x561135e6c750, 
    func=func@entry=0x5611351def40 <vrrp_script_child_thread>, 
    arg=arg@entry=0x561135e6d760, pid=9347, timer=timer@entry=40000000)
    at scheduler.c:1250
#1  0x00005611351f3045 in system_call_script (m=0x561135e6c750, 
    func=func@entry=0x5611351def40 <vrrp_script_child_thread>, 
    arg=arg@entry=0x561135e6d760, timer=40000000, script=0x561135e6d768)
    at notify.c:400
#2  0x00005611351def20 in vrrp_script_thread (thread=0x561135e6cf00)
    at vrrp_scheduler.c:880
#3  0x00005611351f68c1 in thread_call (thread=0x561135e6cf00)
    at scheduler.c:1677
#4  process_threads (m=0x561135e6c750) at scheduler.c:1720
#5  0x00005611351f6f31 in launch_thread_scheduler (m=<optimized out>)
    at scheduler.c:1815
#6  0x00005611351d1bab in start_vrrp_child () at vrrp_daemon.c:1017
#7  0x00005611351b93f5 in start_keepalived () at main.c:434
#8  keepalived_main (argc=9, argv=0x7ffc2d6ff748) at main.c:1887
#9  0x00007f4c6bd69830 in __libc_start_main (main=0x5611351b73d0 <main>, 
    argc=9, argv=0x7ffc2d6ff748, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7ffc2d6ff738)
    at ../csu/libc-start.c:291
#10 0x00005611351b7409 in _start ()
(gdb) quit
A debugging session is active.

        Inferior 1 [process 6322] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/sbin/keepalived, process 6322

vadapalliravikumar commented 5 years ago

@pqarmitage , can you share steps to build keepalived with a given kernel version. I tried few things, but not successful in building with the right kernel version

pqarmitage commented 5 years ago

@vadapalliravikumar If you build keepalived on, for example, maglev-master-1, then it should use the kernel headers for the version of Linux that it is running, namely Linux 4.13. The INSTALL file in the root of the source tree identifies the packages that need to be installed in order to build keepalived.

If you can't build keepalived on the target system, then the following should work (although some of the commands may need a bit of tweaking):

In the source tree, the snap/snapcraft.yaml file show you where you can download the various kernel header .deb packages from. The versions in the snap/snapcraft.yaml file may be out of data, in which case have a look in http://security.ubuntu.com/ubuntu/pool/main/l/linux/ for linux-libc files.
After downloading the package, execute ar -x PACKAGE_DEB_FILE_NAME
tar -x data.tar.xz, which will extract the files into ./usr/include
Change to the keepalived source tree
Run configure --with-kernel-dir=/PATH_TO_EXTRACTED_FILES/usr (the path name must be a fully specified path, without any ~ or other special characters.
make

I've just tried this, and it has worked for me.

vadapalliravikumar commented 5 years ago

thank you @pqarmitage for sharing the steps to build with custom kernel headers. I built with the closest option available in http://security.ubuntu.com/ubuntu/pool/main/l/linux/. The issue is seen with this keepalived also:


root@maglev-master-1:/# /usr/sbin/keepalived -v
Keepalived v2.0.10 (11/12,2018)

Copyright(C) 2001-2018 Alexandre Cassen, <acassen@gmail.com>

Built with kernel headers for Linux 4.13.16
Running on Linux 4.13.0-39-generic #44~16.04.1-Ubuntu SMP Thu Apr 5 16:43:10 UTC 2018

configure options: --with-kernel-dir=/tmp/usr --prefix /usr --disable-dynamic-linking

Config options:  LVS VRRP VRRP_AUTH OLD_CHKSUM_COMPAT FIB_ROUTING

System options:  PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 IPV4_DEVCONF LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS FRA_L3MDEV FRA_UID_RANGE RTA_VIA FRA_OIFNAME RTA_TTL_PROPAGATE IFA_FLAGS IP_MULTICAST_ALL LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA LIBIPTC_LINUX_NET_IF_H_COLLISION LIBIPVS_NETLINK IPVS_DEST_ATTR_ADDR_FAMILY IPVS_SYNCD_ATTRIBUTES IPVS_64BIT_STATS VRRP_VMAC SOCK_NONBLOCK SOCK_CLOEXEC O_PATH GLOB_BRACE INET6_ADDR_GEN_MODE VRF SO_MARK SCHED_RT SCHED_RESET_ON_FORK
root@maglev-master-1:/#

pqarmitage commented 5 years ago

@vadapalliravikumar The kernel version that uname gives on Ubuntu is confusing. 4.13.0 does not mean 4.13.0 but rather 4.13.X, so it is almost certain that the actual kernel you are running is 4.13.16 or later, otherwise keepalived would produce a warning saying that keepalived was built for a later kernel than it is running on.

I had been trying to avoid the following approach, but I can see no way around it. You will need to build keepalived with debugging enabled, and then will need to turn it on when running keepalived.

Can you build keepalived with the following additional configure options: --enable-epoll-debug --enable-epoll-thread-dump --enable-vrrp-fd-debug --enable-log-file --enable-asserts

Then run keepalived (on the system where the recv-q fills up) with the following additional command line options: -g/var/tmp/keepalived.log -G -D --debug=DvEv

-g/var/tmp/keepalived.log will make keepalived write log entries to /var/rmp/keepalived*.log -G will stop keepalived write log entries to syslog -D will log extra detail --debug=DvEv turns on epoll debug and epoll thread dump debug for the VRRP child process.

After 10 seconds or so (check that the recv-q is building up), stop keepalived. The log file we are interested in is /var/tmp/keepalived_vrrp.log. This will show all the threads that are queued, and which threads are being processed. The file may well be quite large. You will need to attach that log file to this issue so that we can see what is (or is not) happening.

If you want to interpret the log yourself, look at the code in lib/scheduler.c around the call to epoll_wait().

vadapalliravikumar commented 5 years ago

@pqarmitage, built keepalived with additional configure options and ran it with additional command line options. Looks like --enable-asserts is not there in 2.0.10, the build ignored this flag.

keepalived -v

root@71a415b79b60:/# keepalived -v
Keepalived v2.0.10 (11/12,2018)

Copyright(C) 2001-2018 Alexandre Cassen, <acassen@gmail.com>

Built with kernel headers for Linux 4.13.16
Running on Linux 4.13.0-39-generic #44~16.04.1-Ubuntu SMP Thu Apr 5 16:43:10 UTC 2018

configure options: --with-kernel-dir=/tmp/usr --prefix /usr --disable-dynamic-linking --enable-epoll-debug --enable-epoll-thread-dump --enable-vrrp-fd-debug --enable-log-file --enable-asserts

Config options:  LVS VRRP VRRP_AUTH OLD_CHKSUM_COMPAT FIB_ROUTING EPOLL_DEBUG EPOLL_THREAD_DUMP VRRP_FD_DEBUG FILE_LOGGING LOG_FILE_APPEND

System options:  PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 IPV4_DEVCONF LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS FRA_L3MDEV FRA_UID_RANGE RTA_VIA FRA_OIFNA
ME RTA_TTL_PROPAGATE IFA_FLAGS IP_MULTICAST_ALL LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA LIBIPTC_LINUX_NET_IF_H_COLLISION LIBIPVS_NETLINK IPVS_DEST_ATTR_ADDR_FAMILY IPVS_SYNCD_ATTRIBUTES IPVS_64BIT_STATS VRRP_VMAC SOCK_NONBLOCK SOCK_CLOEXEC O_PATH GLOB_BRACE INET6_ADDR_GEN_MODE V
RF SO_MARK SCHED_RT SCHED_RESET_ON_FORK

/var/tmp/keepalived.log

Mon Dec 17 21:30:44 2018: Command line: '/usr/sbin/keepalived' '--vrrp' '--dont-fork' '--log-console' '--log-detail'
Mon Dec 17 21:30:44 2018:               '--release-vips' '--pid' '/etc/keepalived/keepalived.pid' '-g/var/tmp/keepalived.log'
Mon Dec 17 21:30:44 2018:               '-G' '-D' '--debug=DvEv' '&'
Mon Dec 17 21:30:44 2018: Opening file '/etc/keepalived/keepalived.conf'.

netstat output: note that the packets started accumulating in the recv-q sometime between 21:35:46 and 21:36:18

[Mon Dec 17 21:35:46 UTC] maglev@10.64.89.183 (maglev-master-1) ~
$ sudo netstat -anp | grep keepalived
raw        0      0 0.0.0.0:112             0.0.0.0:*               7           14426/keepalived
raw        0      0 0.0.0.0:112             0.0.0.0:*               7           14426/keepalived

[Mon Dec 17 21:36:18 UTC] maglev@10.64.89.183 (maglev-master-1) ~
$ sudo netstat -anp | grep keepalived
raw        0      0 0.0.0.0:112             0.0.0.0:*               7           14426/keepalived
raw    18304      0 0.0.0.0:112             0.0.0.0:*               7           14426/keepalived

Attached is the /var/tmp/keepalived_vrrp.log keepalived_vrrp.log

pqarmitage commented 5 years ago

@vadapalliravikumar That's really helpful (I must have added --enable-asserts since 2.0.10).

The keepalived process on maglev-master-1 is successfully receiving and processing VRRP packets for both VRRP instances, so it must be some other packets that are being queued to the socket.

Would it be possible to run tcpdump on ens192 for 10 or 20 seconds after keepalived starts, saving the data to a file, and posting the saved file here. Seeing what other traffic there is on the interface may give me a clue as to what packets are causing the problem. Are there any other vrrp instances running on the network that ens192 is connected to?

vadapalliravikumar commented 5 years ago

@pqarmitage , ens192 carries a lot of traffic. Is it sufficient to capture just the VRRP packets (ip protocol number = 112) ? if not, can u pls suggest a better filter.

vadapalliravikumar commented 5 years ago

@pqarmitage , had a startling observation just now. The same issue is seen in my setup with keepalived v1.2.24 also. We have been using v1.2.24 for a long time and never hit this issue earlier. Makes me wonder if this issue is independent of the keepalived version.

Also, the queue is getting filled up at much faster rate when compared with the rate at which VRRP packets are coming in.

This whole comment is with v1.2.24

root@maglev-master-1:/# keepalived -v
Keepalived v1.2.24 (08/06,2018)

Copyright(C) 2001-2018 Alexandre Cassen, <acassen@gmail.com>

Build options:  PIPE2 LIBNL3 RTA_ENCAP RTA_NEWDST RTA_PREF RTA_VIA FRA_OIFNAME FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK IPV4_DEVCONF LIBIPTC LIBIPSET LVS LIBIPVS_NETLINK IPVS_SYNCD_ATTRIBUTES IPVS_64BIT_STATS VRRP VRRP_AUTH VRRP_VMAC SOCK_NONBLOCK SOCK_CLOEXEC FIB_ROUTING INET6_ADDR_GEN_MODE SNMP_V3_FOR_V2 SNMP SNMP_KEEPALIVED SNMP_CHECKER SNMP_RFC SNMP_RFCV2 SNMP_RFCV3 SO_MARK
root@maglev-master-1:/# exit
exit


[Tue Dec 18 00:40:42 UTC] maglev@10.64.89.183 (maglev-master-1) kubernetes
$ sudo netstat -nalp | grep keepalived
raw    33792      0 0.0.0.0:112             0.0.0.0:*               7           12420/keepalived
raw        0      0 0.0.0.0:112             0.0.0.0:*               7           12420/keepalived

[Tue Dec 18 00:40:44 UTC] maglev@10.64.89.183 (maglev-master-1) kubernetes
$ 

[Tue Dec 18 00:40:46 UTC] maglev@10.64.89.183 (maglev-master-1) kubernetes
$ sudo tcpdump -i any -ln 'proto 112'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
00:40:50.158046 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:40:50.159134 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:40:51.159088 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:40:51.159271 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:40:52.160206 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:40:52.161330 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:40:53.161347 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:40:53.161459 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:40:54.162381 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:40:54.163463 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:40:55.163459 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:40:55.163603 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:40:56.164479 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:40:56.165579 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:40:57.165548 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:40:57.165686 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:40:58.166418 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:40:58.167487 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:40:59.167454 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:40:59.167591 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:00.168549 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:00.169681 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:01.169588 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:01.169785 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
^C
24 packets captured
35 packets received by filter
2 packets dropped by kernel

[Tue Dec 18 00:41:02 UTC] maglev@10.64.89.183 (maglev-master-1) kubernetes
$ sudo netstat -nalp | grep keepalived
raw    60544      0 0.0.0.0:112             0.0.0.0:*               7           12420/keepalived
raw        0      0 0.0.0.0:112             0.0.0.0:*               7           12420/keepalived

[Tue Dec 18 00:41:03 UTC] maglev@10.64.89.183 (maglev-master-1) kubernetes
$ sudo tcpdump -i any -ln 'proto 112'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
00:41:07.175900 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:07.176134 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:08.177017 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:08.179050 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:09.178067 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:09.178285 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:10.179540 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:10.180409 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:11.180383 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:11.180501 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:12.181443 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:12.182550 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:13.182435 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:13.182661 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
^C
14 packets captured
16 packets received by filter
0 packets dropped by kernel

[Tue Dec 18 00:41:14 UTC] maglev@10.64.89.183 (maglev-master-1) kubernetes
$ sudo netstat -nalp | grep keepalived
raw    77440      0 0.0.0.0:112             0.0.0.0:*               7           12420/keepalived
raw        0      0 0.0.0.0:112             0.0.0.0:*               7           12420/keepalived

[Tue Dec 18 00:41:15 UTC] maglev@10.64.89.183 (maglev-master-1) kubernetes
$ sudo tcpdump -i any -ln 'proto 112'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
00:41:18.187498 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:18.188571 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:19.188434 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:19.188703 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:20.189425 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:20.190537 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:21.190357 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:21.190633 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:22.191308 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:22.192418 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:23.192176 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:23.192538 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:24.193196 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:24.194394 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
^C
14 packets captured
14 packets received by filter
0 packets dropped by kernel

[Tue Dec 18 00:41:25 UTC] maglev@10.64.89.183 (maglev-master-1) kubernetes
$ sudo netstat -nalp | grep keepalived
raw    92928      0 0.0.0.0:112             0.0.0.0:*               7           12420/keepalived
raw        0      0 0.0.0.0:112             0.0.0.0:*               7           12420/keepalived

[Tue Dec 18 00:41:26 UTC] maglev@10.64.89.183 (maglev-master-1) kubernetes
$ sudo tcpdump -i any -ln 'proto 112'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
00:41:28.197130 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:28.198251 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:29.198149 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:29.198341 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:30.199151 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:30.200281 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:31.200235 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:31.200408 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:32.201262 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:32.202337 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:33.201919 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:33.202465 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:34.202939 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:34.204148 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:35.203961 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:35.204259 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:36.204936 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:36.206008 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:37.205877 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:37.206081 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:38.206901 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:38.207967 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
^C
22 packets captured
28 packets received by filter
0 packets dropped by kernel

[Tue Dec 18 00:41:38 UTC] maglev@10.64.89.183 (maglev-master-1) kubernetes
$ sudo netstat -nalp | grep keepalived
raw   111232      0 0.0.0.0:112             0.0.0.0:*               7           12420/keepalived
raw        0      0 0.0.0.0:112             0.0.0.0:*               7           12420/keepalived

[Tue Dec 18 00:41:39 UTC] maglev@10.64.89.183 (maglev-master-1) kubernetes
$ sudo tcpdump -i any -ln 'proto 112'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
00:41:43.211956 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:43.212192 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:44.213368 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:44.214283 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:45.214048 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:45.214392 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:46.215055 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:46.216173 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:47.216166 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:47.216278 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
00:41:48.217158 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 151, prio 100, intvl 100cs, length 12
00:41:48.218255 IP 1.1.1.184 > 1.1.1.183: VRRPv3, Advertisement, vrid 150, prio 100, intvl 100cs, length 12
^C
12 packets captured
12 packets received by filter
0 packets dropped by kernel

vadapalliravikumar commented 5 years ago

@pqarmitage , Attached is the full packet capture from ens192 when the queue was getting filled up. The capture is for around 7 seconds. tcpdump.log

1.1.1.183 is this node's IP. 1.1.182 & 1.1.1.184 are the other 2 nodes in the same cluster.

pqarmitage commented 5 years ago

I think the issue with v1.2.24 is different. With v1.2.24 it is the first of the two sockets that has its recv-q fill up (the socket used for sending), whereas with v2.0.x it is the second socket (the socket used for receiving). A few months ago I added a commit that used BPF to stop packets being queued on the sending socket.

What we need to discover now is what is being queued to the receive socket but is not being passed to the keepalived process when it does a receive. There is nothing obvious in the tcpdump; all the non VRRP traffic appears to be in an IP-IP tunnel, or TCP traffic. Do you have any ideas how to see what is getting held in the recv-q?

vadapalliravikumar commented 5 years ago

@pqarmitage , sorry, i could not find a way to see what is being held in the recv-q.

one more observation: I see that the keepalived_vrrp.log stopped showing packets for the last 30 mins, presumably after the recv-q is filled up.


$ docker exec -it a154e1f03a4f bash
root@maglev-master-1:/# 
root@maglev-master-1:/# tail -f /var/tmp/keepalived_vrrp.log 
Tue Dec 18 20:42:47 2018: ----[ Begin rb_dump io_events ]----
Tue Dec 18 20:42:47 2018: #01 event 0x564523689760 fd 5, flags: 0xd, read 0x56452368b450, write (nil)
Tue Dec 18 20:42:47 2018: #02 event 0x564523689c90 fd 6, flags: 0xd, read 0x56452368a5c0, write (nil)
Tue Dec 18 20:42:47 2018: #03 event 0x564523689d50 fd 7, flags: 0xd, read 0x56452368a380, write (nil)
Tue Dec 18 20:42:47 2018: #04 event 0x56452368b4d0 fd 10, flags: 0xd, read 0x56452368a9a0, write (nil)
Tue Dec 18 20:42:47 2018: ----[ End rb_dump ]----
Tue Dec 18 20:42:47 2018: calling epoll_wait
Tue Dec 18 20:42:47 2018: epoll_wait returned 1 fds
Tue Dec 18 20:42:47 2018: Calling thread function thread_timerfd_handler(), type READY_FD, val/fd/pid 5, status 15 id 5341
Tue Dec 18 20:42:47 2018: Calling thread function vrrp_script_thread(), type READY, val/fd/pid 10, status 0 id 5342

^C
root@maglev-master-1:/# exit
exit

[Tue Dec 18 21:15:26 UTC] maglev@10.64.89.182 (maglev-master-1) ~
$ sudo netstat -nalp | grep keepalived
raw        0      0 0.0.0.0:112             0.0.0.0:*               7           15873/keepalived
raw   213312      0 0.0.0.0:112             0.0.0.0:*               7           15873/keepalived

vadapalliravikumar commented 5 years ago

@pqarmitage , based on the findings till now, what could be happening ? Is the kernel incorrectly holding packets in the q ? or is keepalived not doing a receive on the socket descriptor ?

vadapalliravikumar commented 5 years ago

Also, i see that the track_script is not being invoked by the keepalived (after the queue is filled up).

pqarmitage commented 5 years ago

@vadapalliravikumar - apologies for the delay in responding.

I think what is happening is that two different types of packet are being queued to the receive socket. One type, which we know are vrrp adverts, are being passed to keepalived and correctly processed. The other type, and at the moment we don't know what these are, are being queued to the receive socket, but for some reason are not being passed to keepalived when it does a recvmsg(), and so they build up in the socket receive queue. When the queue is full, no more messages can be queued to it, keepalived no longer receives adverts, and so it transitions to master state.

You mention that track_script is not being invoked by keepalived after the queue is filled up. Could you please attach a copy of the logs (with --debug=DvEv) after the receive queue has filled up so that I can try and see what is happening.

Looking at the man page of recvmsg, it appears that there are flags that can be set on the call to recvmsg() to receive queued errors, and to receive out-of-band data. I think I will need to try and produce a patch for you to try, to see if it is either of those that are causing the receive queue to build up.

pqarmitage commented 5 years ago

@vadapalliravikumar Attached is a patch to apply to the current head of the master branch (commit 44c4dda). It changes the vrrp process from using recvfrom() to using recvmsg(), which provides more information about what is happening. It also means that keepalived can receive messages from the socket error queue, and it writes information to the log about what it receives.

I have only been able to test it in a limited sense, since I don't see socket receive queues filling up. If, however, I set IP_RECVERR/IPV6_RECVERR socket options, I can make the receive queue fill up using the current version of the code. Applying the attached patch stops the receive queue filling up, and it also prints diagnostics about what is being received.

If you could apply the patch, run the modified keepalived, and see whether it stops the socket receive queue filling up and also post the log output that is generated, that would be very helpful. I hope it will show what is going on, so that a proper patch can be developed to fix the problem.

recvmsg.patch.txt

vadapalliravikumar commented 5 years ago

@pqarmitage , sorry for the delay, was away during the new year eve.

I shall apply ur patch, run it and then share the observations and the logs.

pqarmitage commented 5 years ago

@vadapalliravikumar What I found was that if keepalived is running on one system, but is not running on one or more of the other systems, then the error queue starts filling up with ICMP error messages.

vadapalliravikumar commented 5 years ago

@pqarmitage , in my setup, all the 3 keepalived instances are running when the queue is filling up. One point to share though, all my teammates have their 3 keepalived instances setup in the same subnet as mine, and we are using unicast with unique vrrp instance ids. Not sure if having many keepalived clusters in the same subnet can cause anything wrt ICMP errors.

vadapalliravikumar commented 5 years ago

keepalived_vrrp.log

Thu Jan  3 00:39:35 2019: ----[ End list_dump ]----
Thu Jan  3 00:39:35 2019: ----[ Begin rb_dump io_events ]----
Thu Jan  3 00:39:35 2019: #01 event 0x55db59d32760 fd 5, flags: 0xd, read 0x55db59d33350, write (nil)
Thu Jan  3 00:39:35 2019: #02 event 0x55db59d32c90 fd 6, flags: 0xd, read 0x55db59d32cd0, write (nil)
Thu Jan  3 00:39:35 2019: #03 event 0x55db59d32d50 fd 7, flags: 0xd, read 0x55db59d32d90, write (nil)
Thu Jan  3 00:39:35 2019: #04 event 0x55db59d344a0 fd 10, flags: 0xd, read 0x55db59d34420, write (nil)
Thu Jan  3 00:39:35 2019: ----[ End rb_dump ]----
Thu Jan  3 00:39:35 2019: calling epoll_wait
Thu Jan  3 00:39:35 2019: epoll_wait returned 1 fds
Thu Jan  3 00:39:35 2019: Calling thread function thread_timerfd_handler(), type READY_FD, val/fd/pid 5, status 15 id 4099
Thu Jan  3 00:39:35 2019: Calling thread function vrrp_script_thread(), type READY, val/fd/pid 10, status 0 id 4100
root@maglev-master-1:/#
root@maglev-master-1:/# date
Thu Jan  3 22:47:01 UTC 2019

@pqarmitage , attached is the last 10K lines of keepalived_vrrp.log after the queue is filled up. No fresh messages are seen when I checked. I checked at Jan 3 22:47 UTC and the last log is at Jan 3 00:39 UTC. Please note that this is w/o applying your patch.

I will apply your patch and collect the same logs.

pqarmitage commented 5 years ago

@vadapalliravikumar I'm hoping that the diagnostic information logged when the patch is applied will tell us what is happening. It could well be that having other vrrp unicast instances with some not running, so ICMP errors are generated, cause the problem. I hope that there is sufficient information logged by the patched code that we will be able to work it out.

vadapalliravikumar commented 5 years ago

@pqarmitage , applied your patch and brought up a keepalived cluster of 3 instances. Left it running for 2 days. I did not see the queue build up (most probably, the issue is resolved with your patch). Below archive contains the logs from the 3 keepalived instances. As they were running for 2 days, the logs are huge.

keepalived logs.zip

pqarmitage commented 5 years ago

@vadapalliravikumar In the logs you have provided, there a no log entries from the patch I provided, so it appears that additional code in the patch hasn't done anything. I'm therefore not convinced that the modified code has fixed the problem, especially since you indicated at the beginning that this only sometimes happens.

Would it be possible for you to run the patched code again, but this time without the --debug=DvEv option, so that the log files are much smaller. Would it also be possible to start and stop various of your keepalived teammates' keepalived instances to see if there is any connection with that.

vadapalliravikumar commented 5 years ago

@pqarmitage , i tried two more times.

once, with the below conditions, ran for 2+ days but did not see the issue.
- with the patch
- without --debug=DvEv
- with only 2 of the 3 keepalived instances up
once, with the below conditions, and hit the issue in less than an hour
- without the patch,
- with only 2 of the 3 keepalived instances up

Out of 2 total attempts to recreate the issue with the patch, did not hit the issue even once, even after leaving the setup for 2+ days. Tending to think that something in the patch is handling the issue.

Do you think the difference between recvfrom and recvmsg could have made some difference ?

pqarmitage commented 5 years ago

@vadapalliravikumar I would like to produce one or two more patches to try and identify what has caused the problem to be resolved. Would you be happy to test them?

I don't think the change from recvfrom to recvmsg on its own has resolved the problem, but I also added some code that does some checking of the return value from recvfrom/recvmsg and it may be that this has resovled your problem.

vadapalliravikumar commented 5 years ago

@pqarmitage , yes, I will be glad to try more patches. Please share them.

pqarmitage commented 5 years ago

@vadapalliravikumar I have merged a series of commits today that incorporate most of the patch above (the patches do not include handling the error queue of sockets since keepalived does not enable messages being queued to the error queue, and the logs you provided do not show anything being queued to the error queue).

I would be grateful if you could try the latest code and see if your problem is still resolved with this code.

vadapalliravikumar commented 5 years ago

@pqarmitage , built keepalived from the master branch (including ur latest commits) and deployed it. Shall wait and see if the issue is seen.

Can you please explain the possible root cause and its fix.

pqarmitage commented 5 years ago

@vadapalliravikumar Unfortunately I cannot explain the root cause, mainly because I can't reproduce the problem, and so I can't be certain what was causing the data to be queued on the receive queue of the socket. What I can say is the unpatched code was not right, since it didn't check for any errors returned by recvfrom(); what I can say is I don't think the change from recvfrom() to recvmsg() in itself resolves the problem.

What the original patch above did is:

Use recvmsg() rather than recvfrom()
loop on recvmsg() while it returns errno EINTR
Check for error returns from recvmsg()

Since there are no log entries of this occurring, this appears this is not happening
Set the message DONTWAIT option on recvmsg() ~~and loop on recvmsg() until it returns EAGAIN (i.e. no more data to receive)~~ and return if it returns EAGAIN

The original patch above didn't loop until EAGAIN; it was commit 3996844 that added this loop.
Use recvmsg() to receive ancillary messages

Since there are no log entries of this occurring, this appears this is not happening
Set the MSG_TRUNC flag to recvmsg() so we know if messages have been truncated when read

Since there are no log entries of this occurring, this appears this is not happening
Read messages from the error queue if recvmsg() reading from the receive queue returns an error not specifically handled.

Since there are no log entries of this occurring, this appears this is not happening
Check of 0 length data is returned and don't attempt to process the packet if that occurs

Since there are no log entries of this occurring, this appears this is not happening

On the basis of the changes above which haven't generated any log messages, the only changes that might have resolved the problem are:

The change from recvfrom() to recvmsg()
Looping on recvmsg() while EINTR is returned
~~Looping on recvmsg() and processing received data while more data is available~~. Using MSG_DONTWAIT and returning if recvmsg() returns errno EAGAIN.

As I indicated above, I don't think 1 above is the resolution, which leaves 2 and or 3. It could be either one of 2 and 3, or require both 2 and 3.

I could very easily produce patches that have 2 only, have 3 only and have both 2 and 3, if you would be willing to test them. I can't think of any other way of identifying the root cause and its fix.

vadapalliravikumar commented 5 years ago

thank you @pqarmitage for the detailed explanation. Let me complete the validation of the current code which fixes both 2 & 3. If we don't hit the issue with the current code, then I can verify individual patches for 2 & 3.

In general, I am open to try any number of patches.

The reason I asked for the root cause is to see if I can come up with the definite steps to recreate the issue. Right now, I am just letting the cluster run for few hours to see if the issue surfaces.

vadapalliravikumar commented 5 years ago

@pqarmitage , built keepalived from master and ran the cluster for 2 days. In my testing, 2 of the 3 keepalived instances were up while the 3rd one is down. I did not see the issue. Looks like it is fixed. We can close the issue now. I can re-open later, if the issue surfaces again.

What would be the best way to consume this fix ?

Would you be able to release 2.0.12 with this fix ? If yes, would it be a long drawn process ? or can we get it done quickly in a day or two ?
If 2.0.12 cannot be released now, can u share steps to generate the release tar from the code. I shall follow those steps and generate a 2.0.11.1 and use it in our production.

pqarmitage commented 5 years ago

There probably haven't been enough changes since 2.0.11 to warrant a new version yet.

If you edit configure.ac and change the line beginning AC_INIT( to change 2.0.11 to 2.0.11.1, and then run make dist, it will produce a tarball for version 2.0.11.1.

acassen commented 5 years ago

You are the first one of my knowledge to report this issue, so even if there might be a few people facing the same issue or maybe just didn't noticed about it, I think we must push a new release. I will proceed during the week end (certainly on Sunday).

Thanks for this deep investigation !

acassen commented 5 years ago

I just spend time on proposed patch. I really think the (while(recvmsg() < 0) solve the issue by forcing an active flush at queue (maybe we could wfind a way on which signal produced EINTR).

There is still a potential side effect with proposed patch by introducing an active loop in our I/O MUX which could potentially lead to a DoS on all handled instances. You will find attached a proposed patch that replace the active loop by resubmitting into I/O MUX a new read operation on socket fd. There is still an incertitude here to know if I/O MUX will really trig vrrp_dispatcher_read in the same way active loop did. If so proposed patch will lead also to an active flushing loop but it will resubmit to the whole I/O MUX and will reduce penalty over scheduler timing/latency.

@vadapalliravikumar , would it be possible to you to try following patch and report us if it still solve your issue ? vrrp_scheduler.patch.gz

pqarmitage commented 5 years ago

I have been doing some further testing, and I can generate a situation where a packet gets left on a socket receive queue. If I have 255 IPv4 instances and 255 IPv6 instances, not using VMACs, using VRRPv3 with an advert interval of 10ms, then if I start the lower keepalived instance first, then afterwards start the higher priority instance, I often see a recv-q length of 768 on the new master, and it seems that that packet is never received (it appears that 768 bytes is the space consumed by one advert packet on the receive queue). If subsequently higher priority VRRP instances are started, the keepalived instance with the 768 bytes stuck on the receive queue can still receive other packets.

It could be that if the circumstance that is causing the packet to get "stuck" on the receive queue happens repeatedly over time, then the receive queue might fill up, and the socket will then stop passing any data to the application.

I will investigate this further.

In case anyone wants to do some further testing themselves, the configuration I am using is:

@low net_namespace low
@medium net_namespace medium
@high net_namespace high

@low $IF=eth0
@medium $IF=br0
@high $IF=eth0

global_defs {
    vrrp_strict
    vrrp_iptables TESTIN TESTOUT
        vrrp_garp_master_delay 0
        vrrp_garp_master_repeat 1
        vrrp_garp_master_refresh 0
        vrrp_garp_master_refresh_repeat 2
}

$BLOCK=\
vrrp_instance VI_1_${ID} { \
    interface $IF \
    state BACKUP \
    virtual_router_id ${ID} \
    version 3 \
@high    priority 240 \
@medium priority 220 \
@low    priority 200 \
    advert_int 0.01 \
    virtual_ipaddress { \
      10.2.${ID}.100/32 \
      10.2.${ID}.101/32 \
    } \
    track_interface { \
@low    eth1 \
@medium br1 \
@high   eth1 \
    } \
} \
 \
vrrp_instance VI_6_${ID} { \
    interface $IF \
    state BACKUP \
    virtual_router_id ${ID} \
    version 3 \
@high    priority 240 \
@medium priority 220 \
@low    priority 200 \
    advert_int 0.01 \
    virtual_ipaddress { \
      fe80::${ID}:4000/64 \
      2001:470:69dd:2:${ID}::3000 \
      2001:470:69dd:2:${ID}::4000 \
    } \
}

~SEQ(ID,1,255) $BLOCK

This runs in three network namespaces (low, medium and high) on the same system. If running on different systems, delete the first three lines, and on the keepalived command line specify -i low or -i high for the different keepalived instances.

vadapalliravikumar commented 5 years ago

@acassen , verified your patch in the same setup. 2 of the 3 instances were left running for a day, 3rd one was left in down state intentionally. The issue is not seen after a day. Looks like your patch also fixed the issue.

acassen commented 5 years ago

Ok Thanks for your feedback. I really think the main issue was not handling error returned by recvfrom initially. I will merge this patch keeping work moving from recvfrom to recvmsg with CMSG and TRUNC handling.

vadapalliravikumar commented 5 years ago

thank you @acassen , also pls let me know if you are going to release 2.0.12 with this fix.

acassen commented 5 years ago

We will push a new release this week (tomorrow or saturday), I was just waiting on your feedback to be sure issue was still fixed that way.

vadapalliravikumar commented 5 years ago

cool! thank you @acassen and @pqarmitage for fixing this issue.

pqarmitage commented 5 years ago

I have done some further testing around EINTR. If have written a program that sends 100 SIGUSR1signals to keepalived in 1 second, and run that against keepalived configured with only 1 vrrp instance, with an advert interval of 1 second. With this setup, almost all the signals will be received while keepalived is waiting for epoll_wait() to return, but epoll_wait() never returned an EINTR error.

I also made a test modification so that the vrrp receive socket did not have O_NONBLOCK set, and vrrp_dispatcher_read() loops on recvmsg(), so that most of the time keepalived, when the vrrp instance is in backup state, is waiting for recvmsg() to return; again with 100 SIGUSR1 signals sent in 1 second recvmsg() never returned EINTR.

From the signal(7) man page, EINTR is returned if a signal handler is invoked. However, since keepalived uses a signalfd for notification of signals, and all the signals that keepalived handles are blocked by sigprocmask(), no signal handlers are ever invoked; this explains why EINTR is never returned.

acassen commented 5 years ago

just released new 2.0.12. We can close the issue and reopen if anything new. Thanks Quentin for the deep analysis and @vadapalliravikumar for your time testing.

tik-stbuehler commented 4 years ago

Hi. Sorry for resurrecting this, but I just hit this with debian stable (1:2.0.10-1; debian oldstable 1:1.3.2-1 seems to be working fine):

# ss -tulpenmw | egrep '^Netid|keepalived'
Netid State  Recv-Q  Send-Q             Local Address:Port    Peer Address:Port                                                                                 
raw   UNCONN 0       0                        0.0.0.0:112          0.0.0.0:*     users:(("keepalived",pid=10376,fd=13))                                         
raw   UNCONN 213248  0                        0.0.0.0:112          0.0.0.0:*     users:(("keepalived",pid=10376,fd=12))                                         
raw   UNCONN 0       0                              *:112                *:*     users:(("keepalived",pid=10376,fd=15))                                         
raw   UNCONN 213248  0                              *:112                *:*     users:(("keepalived",pid=10376,fd=14))

As this seems to be a rather severe bug ("you had one job"...) it would be nice if you could specify which commits one would need to backport to fix this in debian stable (or even a minimal bug fix - aparrently there was just some error handling missing?); also see https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=942182.

Right now this is running in our staging environment, so I might even be able to test a patch.

pqarmitage commented 4 years ago

@tik-stbuehler From reading the logs and matching them against the commits (particularly those dated Jan 15), it looks as though commit 3996844 was the one that finally resolved the problem. However, for that commit to apply you will certainly need to apply commit 068c7f9 first. It is possible you will need some or all of the commits in the range a7cd5e6..5ccf9c9. You might also want to apply commits 6b36e3a and 7630e72.

Please bear in mind that we never managed to reproduce this problem, but the combination of changes that we made seemed to resolve the problem for @vadapalliravikumar; the changes were made on the basis that the code was clearly not correct in this area, in particular when receive errors occurred. As a consequence cherry-picking a number of these commits may not work for you - we never really understood what the problem was so far as I can remember.

plantroon commented 3 years ago

Could this happen (filing up of Recv-Q) when there are too many VRRP hosts on the same network?

pqarmitage commented 3 years ago

@plantroon The recv-q should never fill up, and if it does then that indicates a bug in keepalived. In theory there should be no limit to the number of VRRP hosts on the same network; I'm not sure what you mean by "too many VRRP hosts".

As is written above, we never understood the cause of the problem, and so it is quite possible that scenarios exist that we haven't catered for which cause the recv-q to fill up.

What I think you need to do is open a new issue, referencing this issue, and provide full details of your scenario and problem. This should include the output of keepalived -v and the configuration from all hosts and a description of your network. Also a detailed description of the problem, and the output of ss -tulpenmw | egrep '^Netid|keepalived' or netstat -anp and any other details you think would be helpful.

acassen / keepalived

keepalived recv-q full and multiple masters in the cluster #1080