acassen / keepalived

Keepalived
https://www.keepalived.org
GNU General Public License v2.0
3.98k stars 734 forks source link

VRRP using snmp and notify fifo script, fifo receives invalid notify change when snmpd is restarted #1570

Closed dpajin closed 4 years ago

dpajin commented 4 years ago

Describe the bug I am using SNMP with keepalived and VRRP and notify fifo script. When snmpd is restarted, keepalived_vrrp is restarted, but after the new start, vrrp notify fifo script do not receive usual "INSTANCE ", but receives some "junk", like fifo received: NTNE"pnt"BCU 4

Here is the log:

May 12 17:04:51 RMIMH05S snmpd[25547]: Received TERM or STOP signal...  shutting down...
May 12 17:04:51 RMIMH05S systemd[1]: Stopping Simple Network Management Protocol (SNMP) Daemon....
May 12 17:04:51 RMIMH05S Keepalived_vrrp[25687]: AgentX master disconnected us, reconnecting in 15
May 12 17:04:51 RMIMH05S Keepalived_vrrp[25687]: scheduler: There is already read event 0x55cbc5789a10 (read 0x55cbc5789750) registered on fd [16]
May 12 17:04:51 RMIMH05S systemd[1]: Stopped Simple Network Management Protocol (SNMP) Daemon..
May 12 17:04:51 RMIMH05S systemd[1]: Starting Simple Network Management Protocol (SNMP) Daemon....
May 12 17:04:51 RMIMH05S systemd[1]: Started Simple Network Management Protocol (SNMP) Daemon..
May 12 17:04:51 RMIMH05S snmpd[26755]: Turning on AgentX master support.
May 12 17:04:51 RMIMH05S snmpd[26755]: NET-SNMP version 5.7.3
May 12 17:05:00 RMIMH05S snmpd[26755]: Connection from UDP: [127.0.0.1]:33743->[127.0.0.1]:161
May 12 17:05:05 RMIMH05S opennti keepalived_check_opennti.sh: Status OK, node is running.
May 12 17:05:06 RMIMH05S Keepalived_vrrp[25687]: NET-SNMP version 5.7.3 AgentX subagent connected
May 12 17:05:06 RMIMH05S kernel: [ 5743.540611] traps: keepalived[25687] general protection fault ip:55cbc3cd349d sp:7ffd3788ba28 error:0 in keepalived[55cbc3c61000+9b000]
May 12 17:05:06 RMIMH05S Keepalived[25684]: Keepalived_vrrp exited due to segmentation fault (SIGSEGV).
May 12 17:05:06 RMIMH05S Keepalived[25684]:   Please report a bug at https://github.com/acassen/keepalived/issues
May 12 17:05:06 RMIMH05S Keepalived[25684]:   and include this log from when keepalived started, a description
May 12 17:05:06 RMIMH05S Keepalived[25684]:   of what happened before the crash, your configuration file and the details below.
May 12 17:05:06 RMIMH05S Keepalived[25684]:   Also provide the output of keepalived -v, what Linux distro and version
May 12 17:05:06 RMIMH05S Keepalived[25684]:   you are running on, and whether keepalived is being run in a container or VM.
May 12 17:05:06 RMIMH05S Keepalived[25684]:   A failure to provide all this information may mean the crash cannot be investigated.
May 12 17:05:06 RMIMH05S Keepalived[25684]:   If you are able to provide a stack backtrace with gdb that would really help.
May 12 17:05:06 RMIMH05S Keepalived[25684]:   Source version 2.0.20
May 12 17:05:06 RMIMH05S Keepalived[25684]:   Built with kernel headers for Linux 4.15.18
May 12 17:05:06 RMIMH05S Keepalived[25684]:   Running on Linux 5.3.0-46-generic #38~18.04.1-Ubuntu SMP Tue Mar 31 04:17:56 UTC 2020
May 12 17:05:06 RMIMH05S Keepalived[25684]:   Command line: '/usr/local/sbin/keepalived' '--log-detail' '--vrrp' '--snmp'
May 12 17:05:06 RMIMH05S Keepalived[25684]:   configure options: --enable-snmp
May 12 17:05:06 RMIMH05S Keepalived[25684]:   Config options: LIBIPTC LIBIPSET_DYNAMIC NFTABLES LVS VRRP VRRP_AUTH OLD_CHKSUM_COMPAT FIB_ROUTING
May 12 17:05:06 RMIMH05S Keepalived[25684]:                   SNMP_VRRP SNMP_CHECKER
May 12 17:05:06 RMIMH05S Keepalived[25684]:   System options: PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 IPV4_DEVCONF IPV6_ADVANCED_API
May 12 17:05:06 RMIMH05S Keepalived[25684]:                   LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN
May 12 17:05:06 RMIMH05S Keepalived[25684]:                   FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS
May 12 17:05:06 RMIMH05S Keepalived[25684]:                   FRA_L3MDEV FRA_UID_RANGE RTAX_FASTOPEN_NO_COOKIE RTA_VIA FRA_OIFNAME RTA_TTL_PROPAGATE
May 12 17:05:06 RMIMH05S Keepalived[25684]:                   IFA_FLAGS IP_MULTICAST_ALL LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA LIBIPTC
May 12 17:05:06 RMIMH05S Keepalived[25684]:                   LIBIPSET_PRE_V7 NET_LINUX_IF_H_COLLISION LIBIPVS_NETLINK IPVS_DEST_ATTR_ADDR_FAMILY
May 12 17:05:06 RMIMH05S Keepalived[25684]:                   IPVS_SYNCD_ATTRIBUTES IPVS_64BIT_STATS VRRP_VMAC VRRP_IPVLAN IFLA_LINK_NETNSID CN_PROC
May 12 17:05:06 RMIMH05S Keepalived[25684]:                   SOCK_NONBLOCK SOCK_CLOEXEC O_PATH GLOB_BRACE INET6_ADDR_GEN_MODE VRF SO_MARK
May 12 17:05:06 RMIMH05S Keepalived[25684]:                   SCHED_RESET_ON_FORK
May 12 17:05:06 RMIMH05S Keepalived[25684]: VRRP child process(25687) died: Respawning
May 12 17:05:06 RMIMH05S Keepalived[25684]: Starting VRRP child process, pid=26850
May 12 17:05:06 RMIMH05S Keepalived_vrrp[26850]: Registering Kernel netlink reflector
May 12 17:05:06 RMIMH05S Keepalived_vrrp[26850]: Registering Kernel netlink command channel
May 12 17:05:06 RMIMH05S Keepalived_vrrp[26850]: Opening file '/etc/keepalived/keepalived.conf'.
May 12 17:05:06 RMIMH05S Keepalived_vrrp[26850]: Starting SNMP subagent
May 12 17:05:06 RMIMH05S Keepalived_vrrp[26850]: NET-SNMP version 5.7.3 AgentX subagent connected
May 12 17:05:06 RMIMH05S Keepalived_vrrp[26850]: Unsafe permissions found for script '/etc/keepalived/vrrp/check_opennti.sh'.
May 12 17:05:06 RMIMH05S Keepalived_vrrp[26850]: Unsafe permissions found for script '/etc/keepalived/vrrp/notify_vrrp_fifo.sh'.
May 12 17:05:06 RMIMH05S Keepalived_vrrp[26850]: SECURITY VIOLATION - scripts are being executed but script_security not enabled. There are insecure scripts.
May 12 17:05:06 RMIMH05S Keepalived_vrrp[26850]: Assigned address 10.223.30.230 for interface bond3
May 12 17:05:06 RMIMH05S Keepalived_vrrp[26850]: Assigned address fe80::2c2a:13ff:fea4:fd0d for interface bond3
May 12 17:05:06 RMIMH05S Keepalived_vrrp[26850]: (opennti) Changing effective priority from 250 to 247
May 12 17:05:06 RMIMH05S Keepalived_vrrp[26850]: Registering gratuitous ARP shared channel
May 12 17:05:06 RMIMH05S Keepalived_vrrp[26850]: (opennti) removing VIPs.
May 12 17:05:06 RMIMH05S Keepalived_vrrp[26850]: VRRP sockpool: [ifindex(11), family(IPv4), proto(112), unicast(0), fd(20,21)]
May 12 17:05:06 RMIMH05S Keepalived_vrrp[26850]: Missed 1 messages on CPU 8
May 12 17:05:06 RMIMH05S Keepalived_vrrp[26850]: VRRP_Script(opennti_check) succeeded
May 12 17:05:06 RMIMH05S Keepalived_vrrp[26850]: (opennti) Entering BACKUP STATE
May 12 17:05:06 RMIMH05S opennti vrrp_notify: fifo received: NTNE"pnt"BCU 4
May 12 17:05:06 RMIMH05S vrrp_notify_console: (BACKUP -> ) OpenNTI Node in unknown state

To Reproduce Any steps necessary to reproduce the behaviour:

Expected behavior A clear and concise description of what you expected to happen.

Keepalived version Output of keepalived -v 2.0.20

Distro (please complete the following information):

Details of any containerisation or hosted service (e.g. AWS) If keepalived is being run in a container or on a hosted service, provide full details

Configuration file: A full copy of the configuration file, obfuscated if necessary to protect passwords and IP addresses

# Global definitions configuration block
global_defs {
    enable_snmp_vrrp
    enable_snmp_checker
    vrrp_version 3
    script_user root
    dynamic_interfaces
    umask 0666
    # VRRP notify using FIFO
    vrrp_notify_fifo "/tmp/vrrp_notify_fifo"
    vrrp_notify_fifo_script "/etc/keepalived/vrrp/notify_vrrp_fifo.sh"
}

vrrp_script opennti_check {
    script       "/etc/keepalived/vrrp/check_opennti.sh"
    interval 5   # in seconds
    timeout 15
    fall 1       # require 1 failures for KO
    rise 1       # require 1 successes for OK
}

vrrp_track_file opennti_track_file {
    file "/etc/keepalived/vrrp/opennti_track_file"
    weight -1
    init_file 0
}

...

Notify and track scripts If any notify or track scripts are in use, please provide copies of them

System Log entries Full keepalived system log entries from when keepalived started

Did keepalived coredump? If so, can you please provide a stacktrace from the coredump, using gdb.

Additional context Add any other context about the problem here.

pqarmitage commented 4 years ago

The log above shows that the keepalived vrrp process segfaulted. Are you able to provide a stack backtrace using gdb so that we can see where the problem occurred?

dpajin commented 4 years ago

Unfortunately, I don't know how to do that. Maybe you can help with the suggestion? I tried something like this, but I don't get any stack trace or I don't know where to look for it?

$ sudo gdb -q -batch -ex run -ex backtrace -ex 'thread apply all backtrace' --args /usr/local/sbin/keepalived --log-detail --vrrp --snmp
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Inferior 1 (process 28050) exited normally]
No stack.

$ sudo gdb -q -batch -ex run -ex 'thread apply all backtrace' --args /usr/local/sbin/keepalived --log-detail --vrrp --snmp
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Inferior 1 (process 28193) exited normally]
pqarmitage commented 4 years ago

After the segfault occurs, a coredump should be produced. The configuration files: /etc/systemd/coredump.conf /etc/systemd/coredump.conf.d/.conf /run/systemd/coredump.conf.d/.conf /usr/lib/systemd/coredump.conf.d/*.conf should help identify the location of the corefile.

Once you have located the coredump file, run gdb <PATH_TO_KEEPALIVED> <PATH_TO_COREDUMP> and then at the gdb prompt, type bt. The output of that should be the stack backtrace.

dpajin commented 4 years ago

I noticed in the mean time a few more thing to mentioned:

Version:

$ keepalived --version
Keepalived v2.0.20 (01/22,2020)

Copyright(C) 2001-2020 Alexandre Cassen, <acassen@gmail.com>

Built with kernel headers for Linux 4.15.18
Running on Linux 5.3.0-46-generic #38~18.04.1-Ubuntu SMP Tue Mar 31 04:17:56 UTC 2020

configure options: --enable-snmp

Config options:  LIBIPTC LIBIPSET_DYNAMIC NFTABLES LVS VRRP VRRP_AUTH OLD_CHKSUM_COMPAT FIB_ROUTING SNMP_VRRP SNMP_CHECKER

System options:  PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 IPV4_DEVCONF IPV6_ADVANCED_API LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS FRA_L3MDEV FRA_UID_RANGE RTAX_FASTOPEN_NO_COOKIE RTA_VIA FRA_OIFNAME RTA_TTL_PROPAGATE IFA_FLAGS IP_MULTICAST_ALL LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA LIBIPTC LIBIPSET_PRE_V7 NET_LINUX_IF_H_COLLISION LIBIPVS_NETLINK IPVS_DEST_ATTR_ADDR_FAMILY IPVS_SYNCD_ATTRIBUTES IPVS_64BIT_STATS VRRP_VMAC VRRP_IPVLAN IFLA_LINK_NETNSID CN_PROC SOCK_NONBLOCK SOCK_CLOEXEC O_PATH GLOB_BRACE INET6_ADDR_GEN_MODE VRF SO_MARK SCHED_RESET_ON_FORK
araujorm commented 4 years ago

@dpajin You have snmp enabled, if you don't need it just remove the lines enable_snmp_vrrp and enable_snmp_checker.

@pqarmitage I'm having the same issue, with just enable_snmp_vrrp. I need SNMP, however.

Fresh compilation of keepalived v2.0.20 on CentOS 8 (built the latest source in an RPM and compiled with the same options as the EPEL package, as follows):

Keepalived v2.0.20 (01/22,2020)

Copyright(C) 2001-2020 Alexandre Cassen, <acassen@gmail.com>

Built with kernel headers for Linux 4.18.0
Running on Linux 4.18.0-147.8.1.el8_1.x86_64 #1 SMP Thu Apr 9 13:49:54 UTC 2020

configure options: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --enable-snmp --enable-snmp-rfc --enable-sha1 --with-init=systemd build_alias=x86_64-redhat-linux-gnu host_alias=x86_64-redhat-linux-gnu PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig CFLAGS=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection LDFLAGS=-Wl,-z,relro -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld

Config options:  LIBIPTC LIBIPSET_DYNAMIC LVS VRRP VRRP_AUTH OLD_CHKSUM_COMPAT FIB_ROUTING SNMP_V3_FOR_V2 SNMP_VRRP SNMP_CHECKER SNMP_RFCV2 SNMP_RFCV3

System options:  PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 IPV4_DEVCONF IPV6_ADVANCED_API LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS FRA_L3MDEV FRA_UID_RANGE RTAX_FASTOPEN_NO_COOKIE RTA_VIA FRA_OIFNAME FRA_PROTOCOL FRA_IP_PROTO FRA_SPORT_RANGE FRA_DPORT_RANGE RTA_TTL_PROPAGATE IFA_FLAGS IP_MULTICAST_ALL LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA LIBIPTC NET_LINUX_IF_H_COLLISION LIBIPVS_NETLINK IPVS_DEST_ATTR_ADDR_FAMILY IPVS_SYNCD_ATTRIBUTES IPVS_64BIT_STATS VRRP_VMAC VRRP_IPVLAN IFLA_LINK_NETNSID CN_PROC SOCK_NONBLOCK SOCK_CLOEXEC O_PATH GLOB_BRACE INET6_ADDR_GEN_MODE VRF SO_MARK SCHED_RESET_ON_FORK

This also happens with the EPEL release, which is too old (2.0.10) for us to care right now I guess, so I was hoping the latest version would have fixed it, but no luck.

I'm going to try to generate the core dump you asked and send the info.

araujorm commented 4 years ago

Ok, I was able to run it with coredump gdb, also installed the debug infos so this should make it easier (although it still complains about some missing, they are installed). For now I got this:

Core was generated by `/usr/sbin/keepalived -D'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __rb_erase_augmented (augment=<optimized out>, leftmost=0x55a419686308, root=0x55a419686300, node=0x55a419713258) at rbtree_augmented.h:205
205         tmp = child->rb_left;
Missing separate debuginfos, use: dnf debuginfo-install audit-libs-3.0-0.13.20190507gitf58ec40.el8.x86_64 openssl-libs-1.1.1c-2.el8_1.1.x86_64 rpm-libs-4.14.2-26.el8_1.x86_64 sssd-client-2.2.0-19.el8_1.1.x86_64
(gdb) bt
#0  __rb_erase_augmented (augment=<optimized out>, leftmost=0x55a419686308, root=0x55a419686300, node=0x55a419713258) at rbtree_augmented.h:205
#1  rb_erase_cached (node=node@entry=0x55a419713258, root=root@entry=0x55a419686300) at rbtree.c:479
#2  0x000055a418ca3c95 in thread_move_ready (type=16, thread=0x55a419713210, root=0x55a419686300, m=0x55a419686300) at scheduler.c:1718
#3  thread_fetch_next_queue (m=0x55a419686300) at scheduler.c:1718
#4  process_threads (m=0x55a419686300) at scheduler.c:1790
#5  0x000055a418ca42a5 in launch_thread_scheduler (m=<optimized out>) at scheduler.c:1942
#6  0x000055a418c6ebeb in start_vrrp_child () at vrrp_daemon.c:1047
#7  start_vrrp_child () at vrrp_daemon.c:917
#8  0x000055a418c6ec36 in vrrp_respawn_thread (thread=<optimized out>) at vrrp_daemon.c:859
#9  0x000055a418ca3996 in thread_call (thread=0x55a41968d1c0) at scheduler.c:1834
#10 process_threads (m=0x55a419686f20) at scheduler.c:1834
#11 0x000055a418ca42a5 in launch_thread_scheduler (m=<optimized out>) at scheduler.c:1942
#12 0x000055a418c4b7c4 in keepalived_main (argc=2, argv=<optimized out>) at main.c:2220
#13 0x00007efe2b65c873 in __libc_start_main (main=0x55a418c498b0 <main>, argc=2, argv=0x7fff19ca7d58, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7fff19ca7d48) at ../csu/libc-start.c:308
#14 0x000055a418c498ee in _start ()

Anything else I can provide to help fixing this?

Thanks in advance.

pqarmitage commented 4 years ago

@araujorm Do you do a reload of the keepalived configuration before the segfault (@dpajin indicates that he has done a reload).

The stack backtrace, and the symptoms described, look very like issue #1561. Can you please try using keepalived at commit 1344729 and see if that resolves the problem.

araujorm commented 4 years ago

No, it happens when I start keepalived with the enable_snmp_vrrp option. In version 2.10 it happened every time as long as that option was active, now it just happens if keepalived is master (and it goes on in a loop, crashing and restarting, breaking hell loose since the notify scripts are launched multiple times in parallel).

Will try the commit you mentioned tomorrow.

dpajin commented 4 years ago

@dpajin You have snmp enabled, if you don't need it just remove the lines enable_snmp_vrrp and enable_snmp_checker.

@araujorm, good catch, my bad! Thanks!

araujorm commented 4 years ago

Hello.

Updated to commit 134472979b602128302112c69a5be0be98c36f58 but issue sitll persists, exactly as before, according to gdb in the exact same spot:

#0  __rb_erase_augmented (augment=<optimized out>, leftmost=0x55cbcda172d8, root=0x55cbcda172d0, node=0x55cbcdaa4968) at rbtree_augmented.h:205
205         tmp = child->rb_left;

(gdb) print child
$1 = (struct rb_node *) 0x322e38392e353831

(gdb) bt
#0  __rb_erase_augmented (augment=<optimized out>, leftmost=0x55cbcda172d8, root=0x55cbcda172d0, node=0x55cbcdaa4968) at rbtree_augmented.h:205
#1  rb_erase_cached (node=node@entry=0x55cbcdaa4968, root=root@entry=0x55cbcda172d0) at rbtree.c:479
#2  0x000055cbcc8afa14 in thread_move_ready (type=16, thread=0x55cbcdaa4920, root=0x55cbcda172d0, m=0x55cbcda172d0) at scheduler.c:1762
#3  thread_fetch_next_queue (m=0x55cbcda172d0) at scheduler.c:1762
#4  process_threads (m=0x55cbcda172d0) at scheduler.c:1834
#5  0x000055cbcc8b0035 in launch_thread_scheduler (m=<optimized out>) at scheduler.c:1989
#6  0x000055cbcc879c91 in start_vrrp_child () at vrrp_daemon.c:1120
#7  start_vrrp_child () at vrrp_daemon.c:990
#8  0x000055cbcc850c92 in start_keepalived (thread=<optimized out>) at main.c:530
#9  0x000055cbcc8af6ee in thread_call (thread=0x55cbcda1ce50) at scheduler.c:1882
#10 process_threads (m=0x55cbcda17ed0) at scheduler.c:1882
#11 0x000055cbcc8b0035 in launch_thread_scheduler (m=<optimized out>) at scheduler.c:1989
#12 0x000055cbcc85304f in keepalived_main (argc=2, argv=<optimized out>) at main.c:2392
#13 0x00007f3e59911873 in __libc_start_main (main=0x55cbcc850b10 <main>, argc=2, argv=0x7ffce2a36568, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7ffce2a36558) at ../csu/libc-start.c:308
#14 0x000055cbcc850b4e in _start ()

It happens every time I use enable_snmp_vrrp.

I think I know when this is happening: we have a notify script that restarts some services when the state changes. One of those is the snmp service (to ensure cacti and alikes don't start messing up when openvpn tunnels are also restarted). This used to work fine in older versions of keepalived like 1.3.5, and keepalived recovered the masterx connection fine, however in recent versions keepalived vrrp process just segfaults.

Excerpt from the log when the crash occurs (the notify script sends its output to syslog with the tag keepalived-change:

May 16 19:51:12 machine Keepalived_vrrp[6107]: Sending gratuitous ARP on <secure_cut_iface> for <secure_cut_ip>
May 16 19:51:12 machine keepalived-change[6416]: Redirecting to /bin/systemctl restart snmpd.service
May 16 19:51:12 machine systemd[1]: Stopping Simple Network Management Protocol (SNMP) Daemon....
May 16 19:51:12 machine snmpd[4393]: Received TERM or STOP signal...  shutting down...
May 16 19:51:12 machine Keepalived_vrrp[6107]: AgentX master disconnected us, reconnecting in 15
May 16 19:51:12 machine Keepalived_vrrp[6107]: scheduler: There is already read event 0x55cbcdaa4380 (read 0x55cbcdaa4280) registered on fd [16]
May 16 19:51:12 machine kernel: traps: keepalived[6107] general protection fault ip:55cbcc8b63e1 sp:7ffce2a358d8 error:0 in keepalived[55cbcc846000+9b000]
May 16 19:51:12 machine systemd[1]: Stopped Simple Network Management Protocol (SNMP) Daemon..
May 16 19:51:12 machine systemd[1]: Starting Simple Network Management Protocol (SNMP) Daemon....
May 16 19:51:12 machine systemd[1]: Started Process Core Dump (PID 6427/UID 0).
May 16 19:51:12 machine snmpd[6430]: Turning on AgentX master support.
May 16 19:51:12 machine snmpd[6430]: Turning on AgentX master support.
May 16 19:51:12 machine snmpd[6430]: NET-SNMP version 5.8
May 16 19:51:12 machine systemd[1]: Started Simple Network Management Protocol (SNMP) Daemon..
(...)
May 16 19:51:13 machine Keepalived[6106]: pid 6107 exited due to segmentation fault (SIGSEGV).
May 16 19:51:13 machine Keepalived[6106]:  Please report a bug at https://github.com/acassen/keepalived/issues
May 16 19:51:13 machine Keepalived[6106]:  and include this log from when keepalived started, a description
May 16 19:51:13 machine Keepalived[6106]:  of what happened before the crash, your configuration file and the details below.
May 16 19:51:13 machine Keepalived[6106]:  Also provide the output of keepalived -v, what Linux distro and version
May 16 19:51:13 machine Keepalived[6106]:  you are running on, and whether keepalived is being run in a container or VM.
May 16 19:51:13 machine Keepalived[6106]:  A failure to provide all this information may mean the crash cannot be investigated.
May 16 19:51:13 machine Keepalived[6106]:  If you are able to provide a stack backtrace with gdb that would really help.
May 16 19:51:13 machine Keepalived[6106]:  Source version 2.0.20
May 16 19:51:13 machine Keepalived[6106]:  Built with kernel headers for Linux 4.18.0
May 16 19:51:13 machine Keepalived[6106]:  Running on Linux 4.18.0-147.8.1.el8_1.x86_64 #1 SMP Thu Apr 9 13:49:54 UTC 2020
May 16 19:51:13 machine Keepalived[6106]:  Command line: '/usr/sbin/keepalived' '-D'
May 16 19:51:13 machine Keepalived[6106]:  configure options: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix=
May 16 19:51:13 machine Keepalived[6106]:                     --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin
May 16 19:51:13 machine Keepalived[6106]:                     --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share
May 16 19:51:13 machine Keepalived[6106]:                     --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec
May 16 19:51:13 machine Keepalived[6106]:                     --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man
May 16 19:51:13 machine Keepalived[6106]:                     --infodir=/usr/share/info --enable-snmp --enable-snmp-rfc --enable-sha1
May 16 19:51:13 machine Keepalived[6106]:                     --with-init=systemd build_alias=x86_64-redhat-linux-gnu
May 16 19:51:13 machine Keepalived[6106]:                     host_alias=x86_64-redhat-linux-gnu
May 16 19:51:13 machine Keepalived[6106]:                     PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig CFLAGS=-O2 -g -pipe
May 16 19:51:13 machine Keepalived[6106]:                     -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS
May 16 19:51:13 machine Keepalived[6106]:                     -fexceptions -fstack-protector-strong -grecord-gcc-switches
May 16 19:51:13 machine Keepalived[6106]:                     -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1
May 16 19:51:13 machine Keepalived[6106]:                     -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic
May 16 19:51:13 machine Keepalived[6106]:                     -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection
May 16 19:51:13 machine Keepalived[6106]:                     LDFLAGS=-Wl,-z,relro -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld
May 16 19:51:13 machine Keepalived[6106]:  Config options: LIBIPSET_DYNAMIC LVS VRRP VRRP_AUTH OLD_CHKSUM_COMPAT FIB_ROUTING SNMP_V3_FOR_V2
May 16 19:51:13 machine Keepalived[6106]:                  SNMP_VRRP SNMP_CHECKER SNMP_RFCV2 SNMP_RFCV3
May 16 19:51:13 machine Keepalived[6106]:  System options: PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 IPV4_DEVCONF IPV6_ADVANCED_API
May 16 19:51:13 machine Keepalived[6106]:                  LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN
May 16 19:51:13 machine Keepalived[6106]:                  FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS
May 16 19:51:13 machine Keepalived[6106]:                  FRA_L3MDEV FRA_UID_RANGE RTAX_FASTOPEN_NO_COOKIE RTA_VIA FRA_OIFNAME FRA_PROTOCOL
May 16 19:51:13 machine Keepalived[6106]:                  FRA_IP_PROTO FRA_SPORT_RANGE FRA_DPORT_RANGE RTA_TTL_PROPAGATE IFA_FLAGS
May 16 19:51:13 machine Keepalived[6106]:                  IP_MULTICAST_ALL LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA IPTABLES
May 16 19:51:13 machine Keepalived[6106]:                  NET_LINUX_IF_H_COLLISION LIBIPVS_NETLINK IPVS_DEST_ATTR_ADDR_FAMILY
May 16 19:51:13 machine Keepalived[6106]:                  IPVS_SYNCD_ATTRIBUTES IPVS_64BIT_STATS VRRP_VMAC VRRP_IPVLAN IFLA_LINK_NETNSID CN_PROC
May 16 19:51:13 machine Keepalived[6106]:                  SOCK_NONBLOCK SOCK_CLOEXEC O_PATH GLOB_BRACE INET6_ADDR_GEN_MODE VRF SO_MARK
May 16 19:51:13 machine Keepalived[6106]:                  SCHED_RESET_ON_FORK
May 16 19:51:13 machine Keepalived[6106]: VRRP child process(6107) died: Respawning

So note the part with There is already read event and then the segfault, I think that's where the problem should be residing?

araujorm commented 4 years ago

Hello.

I've created a simple way to always reproduce this issue, with the following config files.

vrrp_instance vrrp1 { state BACKUP interface enp1s0 # <-- CHANGE HERE TO MEET YOUR INTERFACE virtual_router_id 20 advert_int 1 authentication { auth_type AH auth_pass somthing } notify /etc/keepalived/keepalived-change.sh }


* `/etc/keepalived/keepalived-change.sh`:

!/bin/bash

systemctl restart snmpd exit 0

(don't forget to `chmod +x /etc/keepalived/keepalived-change.sh`)

Next, ensure you have snmpd installed (net-snmpd package on RHEL based OSes), add `master agentx` to `/etc/snmp/snmpd.conf` (or equivalent on your distro), and start snmpd (e.g. `systemctl start snmp`). Then fire keepalived on foreground with:

keepalived -D -n -l


Results of the crash loop, as posted before, should be almost instantaneous.

Reproducible on all versions of keepalived (including master) at least since version 2.0.10. Confirmed on fresh CentOS 8 installation, fully updated, with keepalived that comes with epel, also one built from commit 134472979b602128302112c69a5be0be98c36f58 and also one built from master HEAD (currently ab568a70c5d36c8cfe7b23b24a1891540ed479fa), all same result.
pqarmitage commented 4 years ago

@araujorm Many thanks for the info. I will have a look at this in the next few days.

pqarmitage commented 4 years ago

Commit 616ad32 resolves this issue.