acassen / keepalived

Keepalived
https://www.keepalived.org
GNU General Public License v2.0
4.02k stars 736 forks source link

Keepalived_healthcheckers[62205]: A thread timer expired 5.167032 seconds ago #2271

Closed qd19zzx closed 1 month ago

qd19zzx commented 1 year ago

Describe the issue We're using keepalived to perform health checks on the realserver. We've noticed that the interval between TCP health checks is inconsistent with the deloy_loop setting. Although deloy_loop is set to 3 seconds, the actual interval is 7 seconds. Additionally, we've seen a large number of 'Keepalived_healthcheckers[62205]: A thread timer expired 5.167316 seconds ago' in our logs, I think the interval inconsistent maybe related to this. And the comment in the code suggests that it may be related to high load, but we have sufficient CPU and memory. What other factors could cause A thread timer expired and lead to inaccurate interval times?

top - 15:23:10 up 2 days, 1:44, 3 users, load average: 5.00, 4.92, 4.56 Tasks: 809 total, 1 running, 808 sleeping, 0 stopped, 0 zombie %Cpu0 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu4 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu5 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu6 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu7 : 0.0 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.3 hi, 0.0 si, 0.0 st %Cpu8 : 0.3 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu9 : 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu10 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu11 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu12 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu13 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu14 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu15 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu16 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu17 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu18 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu19 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu20 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu21 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu22 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu23 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu24 : 1.3 us, 2.0 sy, 0.0 ni, 96.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu25 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu26 : 99.7 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.3 hi, 0.0 si, 0.0 st %Cpu27 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu28 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu29 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu30 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu31 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

[root@localhost l4lb-agent]# free -h total used free shared buff/cache available Mem: 472Gi 237Gi 230Gi 244Mi 4.3Gi 231Gi Swap: 4.0Gi 0.0Ki 4.0Gi

Keepalived version Output of keepalived -v Keepalived v2.2.7 (03/21,2023), git commit +

Copyright(C) 2001-2023 Alexandre Cassen, acassen@gmail.com

Built with kernel headers for Linux 4.18.0 Running on Linux 4.18.0-240.el8.x86_64 #1 SMP Mon Jan 4 11:55:39 UTC 2021 Distro: CentOS Linux 8

configure options: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exusr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-vrrp PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig CFLAGS=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXXsr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection LDFLAGS=-Wl,-z,relro

Config options: LVS OLD_CHKSUM_COMPAT INIT=systemd SYSTEMD_NOTIFY

System options: VSYSLOG MEMFD_CREATE LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CFRA_SPORT_RANGE FRA_DPORT_RANGE RTA_TTL_PROPAGATE IFA_FLAGS LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA NET_LINUX_IF_H_COLLISION LIBIPTC_LINUX_NET_IF_H_C

Distro (please complete the following information):

Configuration file: A full copy of the configuration file, obfuscated if necessary to protect passwords and IP addresses

global_defs { lvs_notify_fifo /tmp/lvs_rs_update checker_process_name lvs_checker }

virtual_server 177.17.7.3 4040 { delay_loop 6 persistence_timeout 50 protocol tcp real_server 177.17.6.5 4045 { weight 100 inhibit_on_failure TCP_CHECK { connect_timeout 10 retry 3 delay_before_retry 3 delay_loop 3 connect_port 4041 log_all_failures on } } real_server 177.17.6.4 4044 { weight 100 inhibit_on_failure TCP_CHECK { connect_timeout 10 retry 3 delay_before_retry 3 delay_loop 3 connect_port 4041 log_all_failures on } } real_server 177.17.6.6 4046 { weight 100 inhibit_on_failure TCP_CHECK { connect_timeout 10 retry 3 delay_before_retry 3 delay_loop 3 connect_port 4041 log_all_failures on } } }

System Log entries Full keepalived system log entries from when keepalived started

Apr 02 15:21:11 localhost.localdomain Keepalived_healthcheckers[62205]: A thread timer expired 5.162731 seconds ago Apr 02 15:21:18 localhost.localdomain Keepalived_healthcheckers[62205]: A thread timer expired 5.165175 seconds ago Apr 02 15:21:25 localhost.localdomain Keepalived_healthcheckers[62205]: A thread timer expired 5.166563 seconds ago Apr 02 15:21:32 localhost.localdomain Keepalived_healthcheckers[62205]: A thread timer expired 5.163723 seconds ago Apr 02 15:21:39 localhost.localdomain Keepalived_healthcheckers[62205]: A thread timer expired 5.165628 seconds ago Apr 02 15:21:47 localhost.localdomain Keepalived_healthcheckers[62205]: A thread timer expired 5.167099 seconds ago Apr 02 15:21:54 localhost.localdomain Keepalived_healthcheckers[62205]: A thread timer expired 5.167093 seconds ago Apr 02 15:22:01 localhost.localdomain Keepalived_healthcheckers[62205]: A thread timer expired 5.165756 seconds ago Apr 02 15:22:08 localhost.localdomain Keepalived_healthcheckers[62205]: A thread timer expired 5.167222 seconds ago Apr 02 15:22:15 localhost.localdomain Keepalived_healthcheckers[62205]: A thread timer expired 5.166807 seconds ago

pqarmitage commented 1 year ago

I notice that CPUs 25, 26 and 27 are running at 100% and also the hardware interrupt rate is not insignificant. Could the processes running on CPUs 25, 26 and 27 be using some other (i.e. non memory/CPU) system resources that stop keepalived being scheduled?

Is /tmp local, and are the keepalived logs being written to local filestore? We have experienced similar delays in the past when keepalived is reading/writing files that are on network attached devices.

The thread timer expired times are all 5.16... seconds, which is extremely strange.

Can you capture the network traffic for the TCP_CHECKs? It might give some indication as to why they are running a 7 second intervals, although the thread timer expired messages is likely to be the reason why the checkers are running at 7 or 8 second intervals.

With the information provided above I cannot see any other reason for why the kernel is not scheduling keepalived to run as needed.

qd19zzx commented 1 year ago

Thank you for your reply. In fact, we have tried setting CPU affinity and binding Keepalived to an idle CPU core. However, we still encountered thread timer expired logs. We also do not have network storage. The attachment is the TCP check packet that I captured. tcp_checks.zip

qd19zzx commented 1 year ago

hi @pqarmitage ,I enable the epoll_debug flag. And I noticed that all "A thread timer" are generated when calling epoll_wait on fd 1. Here is the log, the system time is within 1 second, but a timeout of 5.16 seconds is reported. It seems that the timer is inaccurate, and the TCP check interval is also likely to be inaccurate due to the inaccuracy of the timer. What could be the reason for the timer inaccuracy?

Apr 3 09:45:41 localhost Keepalived_healthcheckers[73631]: Setting timer_fd 0.000000001 Apr 3 09:45:41 localhost Keepalived_healthcheckers[73631]: calling epoll_wait Apr 3 09:45:41 localhost Keepalived_healthcheckers[73631]: epoll_wait returned 1 fds Apr 3 09:45:41 localhost Keepalived_healthcheckers[73631]: A thread timer expired 5.166934 seconds ago

pqarmitage commented 1 year ago

@qd19zzx Are you still experiencing this problem?

The only way I can think of to debug this is to run with --debug=DcEc which will produce a large amount of log entries. To enable these options you need to specify configure options --enable-epoll-thread-dump --enable-epoll-debug. I would also suggest --enable-log-file so that log entries can be written to a file rather than via syslog.

Run keepalived with the following options: --debug=DcEc -g/tmp/epoll-debug.log. You will get various log files, but the interesting one should be /tmp/epoll_debug_check.log. It will show all the thread details and we should be able to see what thread timer is expiring.

domel5 commented 2 months ago

I have the same problem and I found that it is related to the kernel version, at least in my case. The last one where threads are not delayed is 5.17.8. If I switch to 5.17.9 (or any other released past 2022-05-15) I can see these delays. I've used kernels from https://kernel.ubuntu.com/mainline/ Keepalived hosts are running on KVM VMs, but hosts have enough capacity to let them work freely. Any Keepalived setting related to scheduler, priority etc. doesn't help.

5sec delay on strace (delay_loop set to 2s):

12:22:11 timerfd_settime(6, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=874307000}}, NULL) = 0
12:22:11 epoll_wait(5, [{events=EPOLLIN, data={u32=2187289856, u64=94650381591808}}], 4, -1) = 1
12:22:12 read(6, "\1\0\0\0\0\0\0\0", 8) = 8
12:22:12 epoll_ctl(5, EPOLL_CTL_DEL, 10, NULL) = 0
12:22:12 close(10)                      = 0
12:22:17 timerfd_settime(6, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
12:22:17 epoll_wait(5, [{events=EPOLLIN, data={u32=2187289856, u64=94650381591808}}], 3, -1) = 1

FDs of the above calls:

keepalive 68210 root    5u  a_inode               0,15        0      65 [eventpoll:6,7,11]
keepalive 68210 root    6u  a_inode               0,15        0      65 [timerfd]

My setup:

root@front-lb1:~# cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION="Ubuntu 24.04 LTS"
root@front-lb1:~# keepalived -v
Keepalived v2.2.8 (04/04,2023), git commit v2.2.7-154-g292b299e+

Copyright(C) 2001-2023 Alexandre Cassen, <acassen@gmail.com>

Built with kernel headers for Linux 6.8.0
Running on Linux 6.8.0-41-generic #41-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug  2 20:41:06 UTC 2024
Distro: Ubuntu 24.04 LTS

configure options: --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --enable-snmp --enable-sha1 --enable-snmp-rfcv2 --enable-snmp-rfcv3 --enable-dbus --enable-json --enable-bfd --enable-regex --with-init=systemd build_alias=x86_64-linux-gnu CFLAGS=-g -O2 -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer  -flto=auto -ffat-lto-objects -fstack-protector-strong -fstack-clash-protection -Wformat -Werror=format-security -fcf-protection -fdebug-prefix-map=/build/keepalived-F53vvG/keepalived-2.2.8=/usr/src/keepalived-1:2.2.8-1build2 LDFLAGS=-Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects -Wl,-z,relro CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=3

Config options:  NFTABLES LVS REGEX VRRP VRRP_AUTH VRRP_VMAC JSON BFD OLD_CHKSUM_COMPAT SNMP_V3_FOR_V2 SNMP_VRRP SNMP_CHECKER SNMP_RFCV2 SNMP_RFCV3 DBUS INIT=systemd SYSTEMD_NOTIFY

System options:  VSYSLOG MEMFD_CREATE IPV6_MULTICAST_ALL IPV4_DEVCONF LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS FRA_L3MDEV FRA_UID_RANGE RTAX_FASTOPEN_NO_COOKIE RTA_VIA FRA_PROTOCOL FRA_IP_PROTO FRA_SPORT_RANGE FRA_DPORT_RANGE RTA_TTL_PROPAGATE IFA_FLAGS LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA NET_LINUX_IF_H_COLLISION LIBIPVS_NETLINK IPVS_DEST_ATTR_ADDR_FAMILY IPVS_SYNCD_ATTRIBUTES IPVS_64BIT_STATS IPVS_TUN_TYPE IPVS_TUN_CSUM IPVS_TUN_GRE VRRP_IPVLAN IFLA_LINK_NETNSID GLOB_BRACE GLOB_ALTDIRFUNC INET6_ADDR_GEN_MODE VRF SO_MARK

I see the same behaviour on Ubuntu 22.04.5 LTS, Linux 5.15.0-122-generic and Keepalived v2.2.4.

pqarmitage commented 2 months ago

@domel5 This is very interesting. Do you know what fd 10 is? it potentially looks as though it is the close(10) that is causing the delay.

Could you please provide a copy of your configuration so that we can have a better idea about what is happening.

domel5 commented 1 month ago

Following the 10th descriptor path was a good idea. It was a HTTP_GET check's socket. Let me explain what was happening in my case. In the kernel commit https://github.com/torvalds/linux/commit/190cc82489f46f9d88e73c81a47e14f80a791e1a source port randomization was introduced. Previously the client was taking source ports incrementally, which caused that it was hard to hit the same quadruplet within 120 seconds (net.netfilter.nf_conntrack_tcp_timeout_fin_wait) while issuing a check to the real server. Already with the source port randomization enabled it turns out that the same src port used to connect to the real server can be selected more than once within 120s (I see it in tcpdump). It wouldn't be a problem on a default real server network configuration, because a 2nd connection with the same quadruplet can be opened while the first one in the TIME_WAIT state still exists in the conntrack. The problem occurs when we turn on iptables synproxy on the real server. Synproxy establishes the 2nd connection, but it doesn't pass it down to the system, so the application can't start talking with the client (keepalived check). The LVS host sees this connection as established, because synproxy completed 3WHS, but the real server doesn't reply to any packets in this connection anymore and here we are - thread timer expired.

This is how it looks on a real server with synproxy enabled, you can see 10.0.0.1 retries ack without success:

root@front1:~# tcpdump -nni bond1 port 33333
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on bond1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
11:11:37.450919 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [S], seq 3004026282, win 64240, options [mss 1460,sackOK,TS val 3112537041 ecr 0,nop,wscale 7], length 0
11:11:37.450987 IP 10.0.0.65.80 > 10.0.0.1.33333: Flags [S.], seq 1444798319, ack 3004026283, win 0, options [mss 1460,sackOK,TS val 4146581719 ecr 3112537041,nop,wscale 10], length 0
11:11:37.451064 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 3112537042 ecr 4146581719], length 0
11:11:37.451101 IP 10.0.0.65.80 > 10.0.0.1.33333: Flags [.], ack 1, win 63, options [nop,nop,TS val 4146581719 ecr 3112537042], length 0
11:11:39.469178 IP 10.0.0.65.80 > 10.0.0.1.33333: Flags [.], ack 1, win 64, options [nop,nop,TS val 4146583737 ecr 3112537042], length 0
11:11:39.469350 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 3112539060 ecr 4146581719], length 0
11:11:39.862859 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [P.], seq 1:3, ack 1, win 502, options [nop,nop,TS val 3112539453 ecr 4146581719], length 2: HTTP
11:11:39.862894 IP 10.0.0.65.80 > 10.0.0.1.33333: Flags [.], ack 3, win 64, options [nop,nop,TS val 4146584130 ecr 3112539453], length 0
11:11:39.863256 IP 10.0.0.65.80 > 10.0.0.1.33333: Flags [P.], seq 1:5205, ack 3, win 64, options [nop,nop,TS val 4146584131 ecr 3112539453], length 5204: HTTP: HTTP/1.1 404 Not Found
11:11:39.863273 IP 10.0.0.65.80 > 10.0.0.1.33333: Flags [F.], seq 5205, ack 3, win 64, options [nop,nop,TS val 4146584131 ecr 3112539453], length 0
11:11:39.863339 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [.], ack 4345, win 501, options [nop,nop,TS val 3112539454 ecr 4146584131], length 0
11:11:39.863365 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [.], ack 5205, win 501, options [nop,nop,TS val 3112539454 ecr 4146584131], length 0
11:11:39.903983 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [.], ack 5206, win 501, options [nop,nop,TS val 3112539495 ecr 4146584131], length 0
11:11:41.909115 IP 10.0.0.65.80 > 10.0.0.1.33333: Flags [.], ack 3, win 64, options [nop,nop,TS val 4146586177 ecr 3112539495], length 0
11:11:41.909262 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [.], ack 5206, win 501, options [nop,nop,TS val 3112541500 ecr 4146584131], length 0
11:11:43.949198 IP 10.0.0.65.80 > 10.0.0.1.33333: Flags [.], ack 3, win 64, options [nop,nop,TS val 4146588217 ecr 3112541500], length 0
11:11:43.949352 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [.], ack 5206, win 501, options [nop,nop,TS val 3112543540 ecr 4146584131], length 0
11:11:45.209125 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [F.], seq 3, ack 5206, win 501, options [nop,nop,TS val 3112544800 ecr 4146584131], length 0
11:11:45.209197 IP 10.0.0.65.80 > 10.0.0.1.33333: Flags [.], ack 4, win 64, options [nop,nop,TS val 4146589477 ecr 3112544800], length 0
^C
19 packets captured
19 packets received by filter
0 packets dropped by kernel
root@front1:~# conntrack -L | grep 33333
tcp      6 102 TIME_WAIT src=10.0.0.1 dst=10.0.0.65 sport=33333 dport=80 src=10.0.0.65 dst=10.0.0.1 sport=80 dport=33333 [ASSURED] mark=0 delta-time=25 use=1
conntrack v1.4.8 (conntrack-tools): 2404 flow entries have been shown.
root@front1:~# tcpdump -nni bond1 port 33333
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on bond1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
11:12:11.189763 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [S], seq 3531195415, win 64240, options [mss 1460,sackOK,TS val 3112570780 ecr 0,nop,wscale 7], length 0
11:12:11.189838 IP 10.0.0.65.80 > 10.0.0.1.33333: Flags [S.], seq 524590409, ack 3531195416, win 0, options [mss 1460,sackOK,TS val 4043713595 ecr 3112570780,nop,wscale 10], length 0
11:12:11.189925 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 3112570781 ecr 4043713595], length 0
11:12:14.945002 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 3112574536 ecr 4043713595], length 0
11:12:15.353039 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 3112574944 ecr 4043713595], length 0
11:12:16.177025 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 3112575768 ecr 4043713595], length 0
11:12:17.841029 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 3112577432 ecr 4043713595], length 0
11:12:21.105039 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 3112580696 ecr 4043713595], length 0
11:12:27.569022 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 3112587160 ecr 4043713595], length 0
11:12:40.881037 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 3112600472 ecr 4043713595], length 0

And without synproxy:

root@front1:~# tcpdump -nni bond1 port 33333
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on bond1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
17:19:46.143415 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [S], seq 247496449, win 64240, options [mss 1460,sackOK,TS val 3134625734 ecr 0,nop,wscale 7], length 0
17:19:46.143497 IP 10.0.0.65.80 > 10.0.0.1.33333: Flags [S.], seq 1103776857, ack 247496450, win 65160, options [mss 1460,sackOK,TS val 2284064786 ecr 3134625734,nop,wscale 10], length 0
17:19:46.143570 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 3134625734 ecr 2284064786], length 0
17:19:47.928719 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [P.], seq 1:3, ack 1, win 502, options [nop,nop,TS val 3134627519 ecr 2284064786], length 2: HTTP
17:19:47.928799 IP 10.0.0.65.80 > 10.0.0.1.33333: Flags [.], ack 3, win 64, options [nop,nop,TS val 2284066571 ecr 3134627519], length 0
17:19:47.929371 IP 10.0.0.65.80 > 10.0.0.1.33333: Flags [P.], seq 1:5205, ack 3, win 64, options [nop,nop,TS val 2284066572 ecr 3134627519], length 5204: HTTP: HTTP/1.1 404 Not Found
17:19:47.929391 IP 10.0.0.65.80 > 10.0.0.1.33333: Flags [F.], seq 5205, ack 3, win 64, options [nop,nop,TS val 2284066572 ecr 3134627519], length 0
17:19:47.929444 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [.], ack 4345, win 501, options [nop,nop,TS val 3134627520 ecr 2284066572], length 0
17:19:47.929463 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [.], ack 5205, win 501, options [nop,nop,TS val 3134627520 ecr 2284066572], length 0
17:19:47.969912 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [.], ack 5206, win 501, options [nop,nop,TS val 3134627561 ecr 2284066572], length 0
17:19:49.616462 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [F.], seq 3, ack 5206, win 501, options [nop,nop,TS val 3134629207 ecr 2284066572], length 0
17:19:49.616521 IP 10.0.0.65.80 > 10.0.0.1.33333: Flags [.], ack 4, win 64, options [nop,nop,TS val 2284068259 ecr 3134629207], length 0
^C
12 packets captured
12 packets received by filter
0 packets dropped by kernel
root@front1:~# conntrack -L | grep 33333
tcp      6 114 TIME_WAIT src=10.0.0.1 dst=10.0.0.65 sport=33333 dport=80 src=10.0.0.65 dst=10.0.0.1 sport=80 dport=33333 [ASSURED] mark=0 delta-time=9 use=1
conntrack v1.4.8 (conntrack-tools): 2221 flow entries have been shown.
root@front1:~# tcpdump -nni bond1 port 33333
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on bond1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
17:20:15.281920 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [S], seq 702785545, win 64240, options [mss 1460,sackOK,TS val 3134654873 ecr 0,nop,wscale 7], length 0
17:20:15.282016 IP 10.0.0.65.80 > 10.0.0.1.33333: Flags [S.], seq 1559066263, ack 702785546, win 65160, options [mss 1460,sackOK,TS val 2284093924 ecr 3134654873,nop,wscale 10], length 0
17:20:15.282102 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 3134654873 ecr 2284093924], length 0
17:20:16.306950 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [P.], seq 1:3, ack 1, win 502, options [nop,nop,TS val 3134655898 ecr 2284093924], length 2: HTTP
17:20:16.307016 IP 10.0.0.65.80 > 10.0.0.1.33333: Flags [.], ack 3, win 64, options [nop,nop,TS val 2284094949 ecr 3134655898], length 0
17:20:16.307416 IP 10.0.0.65.80 > 10.0.0.1.33333: Flags [P.], seq 1:5205, ack 3, win 64, options [nop,nop,TS val 2284094950 ecr 3134655898], length 5204: HTTP: HTTP/1.1 404 Not Found
17:20:16.307439 IP 10.0.0.65.80 > 10.0.0.1.33333: Flags [F.], seq 5205, ack 3, win 64, options [nop,nop,TS val 2284094950 ecr 3134655898], length 0
17:20:16.307478 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [.], ack 5205, win 501, options [nop,nop,TS val 3134655898 ecr 2284094950], length 0
17:20:16.347939 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [.], ack 5206, win 501, options [nop,nop,TS val 3134655939 ecr 2284094950], length 0
17:20:17.585193 IP 10.0.0.1.33333 > 10.0.0.65.80: Flags [F.], seq 3, ack 5206, win 501, options [nop,nop,TS val 3134657176 ecr 2284094950], length 0
17:20:17.585258 IP 10.0.0.65.80 > 10.0.0.1.33333: Flags [.], ack 4, win 64, options [nop,nop,TS val 2284096228 ecr 3134657176], length 0
^C
11 packets captured
11 packets received by filter
0 packets dropped by kernel
root@front1:~# conntrack -L | grep 33333
tcp      6 115 TIME_WAIT src=10.0.0.1 dst=10.0.0.65 sport=33333 dport=80 src=10.0.0.65 dst=10.0.0.1 sport=80 dport=33333 [ASSURED] mark=0 delta-time=7 use=1
conntrack v1.4.8 (conntrack-tools): 2290 flow entries have been shown.

I've used nc to reuse the same client port in a fast way.

I don't think it's a keepalived check's problem, but maybe there should be some additional timeout setting (besides the connect_timeout) to abort clogged check connection. A way to get around the problem with synproxy enabled is to set a lower value to the nf_conntrack_tcp_timeout_fin_wait to reduce the likelihood of 2nd connection collision. I will issue a bug report to the netfilter, because IMO the problem lies somewhere there.

Below my config for reference

global_defs{
  enable_script_security
  script_user root
  lvs_sync_daemon bond1 inst vrid_50
}
virtual_server fwmark 1 {
  delay_loop 2
  lb_algo mh
  lb_kind DR
  persistence_timeout 10
  persistence_granularity 255.255.255.255
  real_server 10.0.0.65 {
    weight 10000
    HTTP_GET {
      connect_port 80
      url {
        path /nginx_status
        status_code 200
      }
      connect_timeout 2
    }
  }
  # + other reals
}
pqarmitage commented 1 month ago

@domel5 Many thanks for that detailed analysis; I certainly would never have managed to work that one out. It would be really helpful if you could post a link to your netfilter bug report here.

So if the range of available source ports is 32768 - 60999 (/proc/sys/net/ipv4/ip_local_port_range), the interval between running the HTTP_GET checker is 3 seconds (the default), and there are no other tcp connections from the system to 10.0.0.65 port 80, then the expected number of times for the HTTP_GET checker to run before the probability of no duplicate port problem drops below 0.5 is 510, i.e. 25.5 minutes. The probability of a duplicate port problem occurring in 1 hour is 0.812, so it means that this problem is going to be seen regularly in the scenario you have described.

I do think, however, that there is a keepalived issue that needs to be resolved, if my understanding is correct. Looking at the strace output:

12:22:11 timerfd_settime(6, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=874307000}}, NULL) = 0
12:22:11 epoll_wait(5, [{events=EPOLLIN, data={u32=2187289856, u64=94650381591808}}], 4, -1) = 1
12:22:12 read(6, "\1\0\0\0\0\0\0\0", 8) = 8
12:22:12 epoll_ctl(5, EPOLL_CTL_DEL, 10, NULL) = 0
12:22:12 close(10)                      = 0
12:22:17 timerfd_settime(6, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
12:22:17 epoll_wait(5, [{events=EPOLLIN, data={u32=2187289856, u64=94650381591808}}], 3, -1) = 1

As I am sure you are aware keepalived uses a timerfd to generate epoll_wait() timeouts, since we need better than the 1msec resolution than epoll_wait() provides. Before the first epoll_wait call above, the timeout is set to 0.874307s. I am guessing that it is the timer_fd triggering that causes the epoll_wait() to return. This appears to be a timeout for the thread (these are keepalived threads) handling the HTTP_GET (or it could be the response to sending the HTTP_GET but that would appear not to be the case here). In any event, keepalived then closes the HTTP_GET fd, but it appears that the close call takes 5 seconds to complete. After that the timerfd timeout is set to 1ns (this means that the next keepalived thread timeout is in the past; it makes it easier in the code to call epoll_wait() again in order to run the next thread, and 1ns is used rather than 0ns since setting the timeout to 0 disarms the timer).

The problem here is that keepalived opens sockets in non-blocking mode, but the close() appears to be blocking, which is not what I expected, and from the documentation I can find close() should return immediately on a non-blocking socket. Is this potentially another kernel issue?

I need to try and reproduce this so that I can add some debugging code to keepalived to better understand what is happening. I suspect that I could reduce /proc/sys/net/ipv4/ip_local_port_range to increase the probability of it happening (a range of only 50 or 60 ports should make it happen very quickly).

Even if the problem is a kernel issue(s) we need to find a workaround for it so that existing systems do not experience the thread timeout problem. At the moment my thoughts are that we might need to have a (pthread type) thread to handle the closing of TCP sockets.

pqarmitage commented 1 month ago

I have now reproduced this problem. When using synproxy, if I execute echo 40000 40099 >/proc/sys/net/ipv4/ip_local_port_range on the system where keepalived is running, it significantly speeds up how quickly the problem occurs.

An even easier way to reproduce the problem is to add the following iptables entry: iptables -I INPUT -m state --state ESTABLISHED -i IF -p tcp --dport PORT --tcp-flags FIN,ACK,SYN,RST FIN,ACK -j DROP

My concern in relation to this issue is why is there the 5 second delay before the call to close() returns, and what can keepalived do about that. It transpires that the delay is caused by the SO_LINGER socket option, which keepalived sets when it opens stream sockets. This is despite Richard Steven's book "Unix Network Programming 2nd edition Volume 1" explicitly stating in the section on SO_LINGER:

If the socket has been set nonblocking, it will not wait for close to complete, even if the linger time is nonzero.

This is wrong, certainly in a Linux environment. It appears that the kernel waits for the first of:

before close() returns. This is explained in the socket(7) man page, although it does not make any reference to whether a socket is non-blocking or not.

Commit b7cb693 removes setting SO_LINGER on TCP sockets. This only affects TCP_CHECK, HTTP_GET, SSL_GET, SMTP_CHECK and sending notify emails.

@domel5 Many thanks for you help in identifying the cause of this issue.

domel5 commented 1 month ago

Thanks @pqarmitage . Finally I've issued the synproxy bug there https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2083830