Closed louis-6wind closed 3 years ago
On last master Keepalived v2.1.5 (11/22,2020), git commit v2.1.5-217-g2ba09972 root@ha1:~/keepalived# systemctl status keepalived ● keepalived.service - LVS and VRRP High Availability Monitor Loaded: loaded (/lib/systemd/system/keepalived.service; disabled; vendor preset: enabled) Active: activating (start) since Tue 2020-11-24 10:39:19 UTC; 1min 24s ago Main PID: 22601 (keepalived) Tasks: 2 (limit: 2353) CGroup: /system.slice/keepalived.service ├─22601 /usr/sbin/keepalived -D └─22602 /usr/sbin/keepalived -D
Nov 24 10:39:19 ha1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. Nov 24 10:39:19 ha1 systemd[1]: keepalived.service: Found left-over process 22602 (keepalived) in control group while starting unit. Ignoring. Nov 24 10:39:19 ha1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Abort of process
Program terminated with signal SIGABRT, Aborted.
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt
stack_end=0x7ffc7dfec8b8) at ../csu/libc-start.c:310
@louis-oui I am fairly certain that the problem is caused by having the startup delay, and then the bond0 interface going from down to up before the delay expires. This is indicated by the sock=0x0 in the call to thread_compute_timer() meaning that the sockpool hasn't yet been set up.
This is certainly something that needs to be fixed, but as a workaround could you try adding dynamic_interfaces
to the global_defs section of your configuration, and remove vrrp_startup_delay 30
from the same section.
Last master (v2.1.5 (11/28,2020), git commit v2.1.5-226-g2499caeb) has now the same behaviour than v2.1.5
Dec 02 17:16:11 ha1 systemd-coredump[1133]: Process 1107 (keepalived) of user 0 dumped core.
Stack trace of thread 1107:
#0 0x000055f04b60bce6 vrrp_compute_timer (keepalived)
#1 0x000055f04b5f49af process_if_status_change (keepalived)
#2 0x000055f04b5f4a40 update_interface_flags (keepalived)
#3 0x000055f04b5f4e8c netlink_link_filter (keepalived)
#4 0x000055f04b5f55cb netlink_broadcast_filter (keepalived)
#5 0x000055f04b5f20cf netlink_parse_info (keepalived)
#6 0x000055f04b5f2681 kernel_netlink (keepalived)
#7 0x000055f04b624768 thread_call (keepalived)
#8 0x000055f04b5f9f6d start_vrrp_child (keepalived)
#9 0x000055f04b5eb0b4 start_keepalived (keepalived)
#10 0x000055f04b624768 thread_call (keepalived)
#11 0x000055f04b5ecf67 keepalived_main (keepalived)
#12 0x00007f8ebfe15b97 __libc_start_main (libc.so.6)
#13 0x000055f04b5ea4da _start (keepalived)
It seems the root cause of https://github.com/acassen/keepalived/issues/1143 is that bonding implementation with LACP on linux does not take into account the LACP partner state. Condition to send data is only the local state (actor in LACP terminology) is OK. The partner is not ready to send data (not collecting/distributing state in LACP). Fix on kernel would be appreciated so that bond netlink state is up only if one slave interface is ready (actor and parter is collecting/distributing).
I tested with bonding active-active without LACP (aka. 802.3ad) mode xor and I don't have the issue.
vrrp_startup_delay is a workaround to this problem.
A working workaround is replacing vrrp_startup_delay parameter by a additional parameter in systemctl keepalived.service
ExecStartPre=/bin/sleep 30
@louis-oui It would be really helpful to have the gdb backtraces, since they give us line numbers and parameter values. Would it be possible for you to generate those and post them for this issue and also issues #1798 and #1799.
Did you try the suggested dynamic_interfaces
without vrrp_startup_delay
to see if that worked?
Tag v2.1.5 coredump
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055d0ff4f6cbc in vrrp_compute_timer (sock=0x0) at vrrp_scheduler.c:344
344 vrrp = rb_entry(rb_first_cached(&sock->rb_sands), vrrp_t, rb_sands);
(gdb) bt
#0 0x000055d0ff4f6cbc in vrrp_compute_timer (sock=0x0) at vrrp_scheduler.c:344
#1 0x000055d0ff4f6cfd in vrrp_thread_requeue_read (vrrp=0x55d101365310) at vrrp_scheduler.c:351
#2 0x000055d0ff4f7b39 in try_up_instance (vrrp=0x55d101365310, leaving_init=false) at vrrp_scheduler.c:690
#3 0x000055d0ff4d852a in process_if_status_change (ifp=0x55d1013625a0) at keepalived_netlink.c:1546
#4 0x000055d0ff4d862e in update_interface_flags (ifp=0x55d1013625a0, ifi_flags=69699) at keepalived_netlink.c:1576
#5 0x000055d0ff4d9e38 in netlink_link_filter (snl=0x7fff6832b434, h=0x55d101364bf0) at keepalived_netlink.c:2172
#6 0x000055d0ff4da2f4 in netlink_broadcast_filter (snl=0x7fff6832b434, h=0x55d101364bf0) at keepalived_netlink.c:2316
#7 0x000055d0ff4d7f57 in netlink_parse_info (filter=0x55d0ff4da28c <netlink_broadcast_filter>, nl=0x55d0ff7483b0 <nl_kernel>, n=0x0, read_all=true) at keepalived_netlink.c:1369
#8 0x000055d0ff4da3a3 in kernel_netlink (thread=0x55d101363d70) at keepalived_netlink.c:2346
#9 0x000055d0ff51d0d0 in thread_call (thread=0x55d101363d70) at scheduler.c:1876
#10 0x000055d0ff51d208 in process_threads (m=0x55d10135a530) at scheduler.c:1940
#11 0x000055d0ff51d5a9 in launch_thread_scheduler (m=0x55d10135a530) at scheduler.c:2047
#12 0x000055d0ff4df94d in start_vrrp_child () at vrrp_daemon.c:1044
#13 0x000055d0ff4cc248 in start_keepalived (thread=0x55d10135eba0) at main.c:530
#14 0x000055d0ff51d0d0 in thread_call (thread=0x55d10135eba0) at scheduler.c:1876
#15 0x000055d0ff51d208 in process_threads (m=0x55d10135a530) at scheduler.c:1940
#16 0x000055d0ff51d5a9 in launch_thread_scheduler (m=0x55d10135a530) at scheduler.c:2047
#17 0x000055d0ff4cec3b in keepalived_main (argc=2, argv=0x7fff6832ba68) at main.c:2384
#18 0x000055d0ff4cbc5a in main (argc=2, argv=0x7fff6832ba68) at main.c:29
Did you try the suggested
dynamic_interfaces
withoutvrrp_startup_delay
to see if that worked?
There is no crash in that case and the same scenario
This issue is not directly related to bonding, but to the bond0 (or any) interface changing state while the vrrp_startup_delay is in progress.
Commit 1d92217 resolves the segfault. Commit 942ae08 resolves an issue whereby the IPv6 link local address of the underlying interface of a VMAC was not copied to the vrrp instance, meaning that keepalived attempted to send VRRP adverts with no source address.
If you need the bonding issue with LACP resolving, you will need to report that to the kernel maintainers.
Describe the bug Process crash at startup
To Reproduce Two machines ha1 and ha2. Linked together by ens4 and ens5.
Observed behavior coredump
Expected behavior No coredump
Keepalived version Keepalived v2.1.5 (07/13,2020)
Copyright(C) 2001-2020 Alexandre Cassen, acassen@gmail.com
Built with kernel headers for Linux 4.15.18 Running on Linux 4.15.0-123-generic #126-Ubuntu SMP Wed Oct 21 09:40:11 UTC 2020
configure options: --prefix=/usr --sysconfdir=/etc --with-extra-cflags=-I/usr/include/libnl3 --with-extra-ldflags= --with-extra-libs=-lnl-genl-3 --disable-lvs --with-init=systemd --host=x86_64-linux-gnu host_alias=x86_64-linux-gnu
Config options: VRRP VRRP_AUTH OLD_CHKSUM_COMPAT FIB_ROUTING
System options: PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 IPV4_DEVCONF IPV6_ADVANCED_API RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS FRA_L3MDEV FRA_UID_RANGE RTAX_FASTOPEN_NO_COOKIE RTA_VIA FRA_OIFNAME RTA_TTL_PROPAGATE IFA_FLAGS IP_MULTICAST_ALL LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA NET_LINUX_IF_H_COLLISION LIBIPTC_LINUX_NET_IF_H_COLLISION VRRP_VMAC VRRP_IPVLAN IFLA_LINK_NETNSID CN_PROC SOCK_NONBLOCK SOCK_CLOEXEC O_PATH GLOB_BRACE INET6_ADDR_GEN_MODE VRF SO_MARK SCHED_RESET_ON_FORK
Distro (please complete the following information):
Details of any containerisation or hosted service (e.g. AWS) no container Configuration file:
cat>/etc/keepalived/keepalived.conf <<\EOF global_defs { router_id router enable_script_security script_user root dynamic_interfaces vrrp_startup_delay 30 }
vrrp_sync_group group15 { group { vrrp } }
vrrp_instance vrrp { version 2 state BACKUP interface bond0
use_vmac vrrp
track_file { }
garp_master_delay 5
virtual_router_id 15
priority 150 advert_int 1.0
virtual_ipaddress { 192.168.15.38/28 }
preempt_delay 30 } EOF
Notify and track scripts NA
System Log entries -- Logs begin at Fri 2020-10-23 11:17:41 UTC, end at Tue 2020-11-24 10:18:53 UTC. -- Nov 24 10:14:49 ubuntu Keepalived_vrrp[15685]: Registering Kernel netlink reflector Nov 24 10:14:49 ubuntu Keepalived_vrrp[15685]: Registering Kernel netlink command channel Nov 24 10:14:49 ubuntu Keepalived_vrrp[15685]: Opening file '/etc/keepalived/keepalived.conf'. Nov 24 10:14:49 ubuntu Keepalived_vrrp[15685]: Sync group group15 has only 1 virtual router(s) - this probably isn't what you want - removing Nov 24 10:14:49 ubuntu Keepalived_vrrp[15685]: Assigned address 192.168.15.41 for interface bond0 Nov 24 10:14:49 ubuntu Keepalived_vrrp[15685]: (vrrp): Success creating VMAC interface vrrp Nov 24 10:14:49 ubuntu Keepalived_vrrp[15685]: (vrrp): entering FAULT state (interface vrrp down) Nov 24 10:14:49 ubuntu Keepalived_vrrp[15685]: (vrrp) entering FAULT state Nov 24 10:14:49 ubuntu Keepalived_vrrp[15685]: Registering gratuitous ARP shared channel Nov 24 10:14:49 ubuntu Keepalived_vrrp[15685]: Delaying startup for 30 seconds Nov 24 10:14:50 ubuntu Keepalived_vrrp[15685]: Netlink reports bond0 down Nov 24 10:14:51 ubuntu Keepalived_vrrp[15685]: Netlink reports bond0 up Nov 24 10:14:51 ubuntu Keepalived_vrrp[15685]: Netlink reports vrrp up Nov 24 10:14:51 ubuntu Keepalived_vrrp[15685]: (vrrp) Entering BACKUP STATE
Did keepalived coredump? (gdb) bt
0 vrrp_compute_timer (sock=0x0) at vrrp_scheduler.c:345
1 vrrp_thread_requeue_read (vrrp=0x55f2d2f53fa0) at vrrp_scheduler.c:351
2 try_up_instance (vrrp=0x55f2d2f53fa0, leaving_init=leaving_init@entry=false) at vrrp_scheduler.c:690
3 0x000055f2d0e9180f in process_if_status_change (ifp=ifp@entry=0x55f2d2f51ae0) at keepalived_netlink.c:1546
4 0x000055f2d0e918a0 in update_interface_flags (ifp=ifp@entry=0x55f2d2f51ae0, ifi_flags=) at keepalived_netlink.c:1576
5 0x000055f2d0e91cb4 in netlink_link_filter (h=0x55f2d2f52690, snl=) at keepalived_netlink.c:2172
6 0x000055f2d0e9241b in netlink_broadcast_filter (snl=, h=) at keepalived_netlink.c:2316
7 0x000055f2d0e8eeff in netlink_parse_info (filter=filter@entry=0x55f2d0e921c0, nl=nl@entry=0x55f2d10e4350 , n=n@entry=0x0,
8 0x000055f2d0e8f4b1 in kernel_netlink (thread=) at keepalived_netlink.c:2346
9 0x000055f2d0ebf7e8 in thread_call (thread=0x55f2d2f53f20) at scheduler.c:1876
10 process_threads (m=0x55f2d2f4c0e0) at scheduler.c:1940
11 0x000055f2d0ebfdf1 in launch_thread_scheduler (m=) at scheduler.c:2047
12 0x000055f2d0e966e8 in start_vrrp_child () at vrrp_daemon.c:1044
13 0x000055f2d0e88d84 in start_keepalived (thread=) at main.c:530
14 0x000055f2d0ebf7e8 in thread_call (thread=0x55f2d2f4cb80) at scheduler.c:1876
15 process_threads (m=0x55f2d2f4ace0) at scheduler.c:1940
16 0x000055f2d0ebfdf1 in launch_thread_scheduler (m=) at scheduler.c:2047
17 0x000055f2d0e8a371 in keepalived_main (argc=2, argv=0x7fff39379de8) at main.c:2384
18 0x00007fee4d036b97 in __libc_start_main (main=0x55f2d0e88270, argc=2, argv=0x7fff39379de8, init=, fini=, rtld_fini=,
19 0x000055f2d0e882aa in _start ()
Additional context NA