Keepalived: scheduler: epoll_wait error: 4

VigneshSP94 commented 3 years ago

Describe the bug

Keepalived fails to update the IPVS.

This has already been reported by my colleague in this ticket . But this time, the epoll error code is different.

We have upgraded the keepalived to 2.2.1 and had it running for a few weeks now. We sometimes set the weights of real servers to 0 to stop distributing traffic to a particular server. We noticed the traffic distribution didn't stop despite of changing the weights.

When checking the logs, we see the epoll_err to be happening again.

The service reloads and healthcheck starts,

Mar 26 04:25:48 hostname systemd[1]: Reloading Keepalive Daemon (LVS).
Mar 26 04:25:48 hostname systemd[1]: Reloaded Keepalive Daemon (LVS).
Mar 26 04:25:48 hostname Keepalived[21974]: Reloading ...
Mar 26 04:25:48 hostname Keepalived[21974]: Configuration file confile.conf
Mar 26 04:25:48 hostname Keepalived_healthcheckers[21976]: Reloading
Mar 26 04:25:48 hostname Keepalived_healthcheckers[21976]: Got SIGHUP, reloading checker configuration
Mar 26 04:25:48 hostname Keepalived_healthcheckers[21976]: Reinitializing ipvs
Mar 26 04:25:48 hostname Keepalived_healthcheckers[21976]: Activating healthchecker for service [x.x.x.x]:tcp:80 for VS [x.x.x.x]:tcp:80
Mar 26 04:25:48 hostname Keepalived_healthcheckers[21976]: Activating healthchecker for service [x.x.x.y]:tcp:80 for VS [x.x.x.x]:tcp:80
#health checks for all real servers
.
.
Mar 26 04:25:48 hostname Keepalived_healthcheckers[21976]: Activating healthchecker for service [100.97.80.5]:tcp:80 for VS [100.97.126.57]:tcp:80
Mar 26 04:25:4 host nane Keepalived_healthcheckers[21976]: Activating healthchecker for service [100.97.80.6]:tcp:80 for VS [100.97.126.57]:tcp:80
Mar 26 04:28:56 hostname Keepalived[21974]: scheduler: epoll_wait error: 4 (Interrupted system call)

Healthcheck for all real servers activated, now we expect to see a succeed message for VS, but the epoll happens and change to IPVS are not propagated.

virtual_server 100.97.126.117 80 {
        delay_loop 3
        lvs_sched mh
        mh-port
        mh-fallback
        lvs_method TUN
        protocol TCP
        real_server 100.97.80.4 80 {
                weight 0
                HTTP_GET {
                        url {
                                path "/ping"
                                status_code 200

root@ipvshost:~# ipvsadm -Ln | grep 100.97.126.117 -A3
TCP  100.97.126.117:80 mh (mh-fallback,mh-port)
  -> 100.97.80.4:80               Tunnel  10     0          0
  -> 100.97.80.5:80               Tunnel  10     0          0
  -> 100.97.80.6:80               Tunnel  10     0          0

We noticed some fixes to the server weights in this commit and tried v2.2.2, but noticed the issue in this version as well.

To Reproduce

The issue is completely sporadic.

Expected behavior We expect the keepalived to update ipvs tables.

Keepalived version

root@ipvshost:~# keepalived -v
Keepalived v2.2.1 (01/17,2021)

Copyright(C) 2001-2021 Alexandre Cassen, <acassen@gmail.com>

Built with kernel headers for Linux 4.15.18
Running on Linux 5.4.0-54-generic #60~18.04.1-Ubuntu SMP Fri Nov 6 17:25:16 UTC 2020
Distro: Ubuntu 18.04.1 LTS

configure options: --prefix=/

Config options:  LIBIPSET_DYNAMIC NFTABLES LVS VRRP VRRP_AUTH OLD_CHKSUM_COMPAT FIB_ROUTING

System options:  PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 MEMFD_CREATE IPV4_DEVCONF IPV6_ADVANCED_API LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS FRA_L3MDEV FRA_UID_RANGE RTAX_FASTOPEN_NO_COOKIE RTA_VIA FRA_OIFNAME RTA_TTL_PROPAGATE IFA_FLAGS IP_MULTICAST_ALL LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA LIBIPSET_PRE_V7 IPTABLES NET_LINUX_IF_H_COLLISION LIBIPVS_NETLINK IPVS_DEST_ATTR_ADDR_FAMILY IPVS_SYNCD_ATTRIBUTES IPVS_64BIT_STATS VRRP_VMAC VRRP_IPVLAN IFLA_LINK_NETNSID CN_PROC SOCK_NONBLOCK SOCK_CLOEXEC O_PATH GLOB_BRACE GLOB_ALTDIRFUNC INET6_ADDR_GEN_MODE VRF SO_MARK SCHED_RESET_ON_FORK

Distro (please complete the following information):

Name: Ubuntu
Version: 18.04
Architecture: x86_64

Details of any containerisation or hosted service (e.g. AWS) If keepalived is being run in a container or on a hosted service, provide full details

Configuration file:

virtual_server 192.0.2.1 80 {
        delay_loop 3
        lvs_sched mh
        mh-port
        mh-fallback
        lvs_method TUN
        protocol TCP
        real_server 198.51.100.1 80 {
                weight 10
                HTTP_GET {
                        url {
                                path "/ping"
                                status_code 200
                        }
                        connect_ip 198.51.100.1
                        connect_port 9001
                        connect_timeout 2
                }
        }
        real_server 198.51.100.2 80 {
                weight 10
                HTTP_GET {
                        url {
                                path "/ping"
                                status_code 200
                        }
                        connect_ip 198.51.100.2
                        connect_port 9001
                        connect_timeout 2
                }
        }
        [... and more ...]
}
[... and more ...]

Notify and track scripts None

System Log entries

Mar 26 04:28:56 hostnane Keepalived[21974]: scheduler: epoll_wait error: 4 (Interrupted system call)

Did keepalived coredump? NO

Additional context No

pqarmitage commented 3 years ago

I note that you are building keepalived with Linux 4.15.18 headers, whereas you appear to be running it on a 5.4.54 kernel. It would be sensible to build keepalived with the kernel headers for the system on which you intend to run keepalived; although it shouldn't cause a problem running on a newer kernel, you would benefit from the kernel features that have been added between 4.15.18 and 5.4.54.

This appears to be a different issue from what was reported in issue #1809. Issue #1809 related to an invalid file descriptor being passed to epoll_wait() in the checker process, causing an EINVAL error. This issue, not only is it occurring in the parent process as opposed to the checker process, but it is due to EINTR being returned.

epoll_wait() errno 4 is EINTR which means that an asynchronous signal handler has run. _UPDATE: Actually this is wrong. On Linux EINTR can be returned to epollwait() if the process is sent SIGSTOP, and this cannot run an asynchronous signal handler. See first comment on 22 July below.

In /proc/PID/status for the keepalived parent process, the relevant entries that I get are:

SigBlk: c000000800014a03
SigIgn: 3ffffff67ffab034
SigCgt: 0000000180000000

and it would be helpful if you could confirm that you have the same entries.

Most signals are ignored, signals HUP, INT, USR1, USR2, TERM, CHLD, 63, 64 are blocked, signals 32 and 33 are caught, and ILL, BUS, FPE, KILL, SEGV, STOP have their default action, which is terminate for KILL, stop the process for STOP, and terminate and dump core for the rest.

The blocked signals are blocked because the signals are received via a signalfd, and so will not cause EINTR to be returned from epoll_wait(). Likewise the ignored signals will not cause EINTR to be returned, nor will the signals that have their default action. The only signals that are handled asynchronously, and hence can cause EINTR to be returned are signals 32 and 33. These signals are used by the NPTL pthread library (see NPTL(7) man page for more details), and although keepalived does not use Linux threads, linking with the crypto, ssl or nl-3 libraries causes the pthread library to also be linked, and this causes signals 32 and 33 to be caught.

The code surrounding epoll_wait() in lib/scheduler.c handling epoll_wait() returning EINTR appears to work correctly, even with check_EINTR defined to do nothing. It would be worth running configure with the --enable-eintr-debug=check option, since that will enable checking for EINTR after epoll_wait() is called, and in other relevant places.

I have run the following script to send all possible signals other than those that cause a specific action to the keepalived parent process:

PID=$(cat /run/checkers.pid)
for i in 3 5 6 13 14 16 18 $(seq 20 31) $(seq 34 62); do kill -$i $PID ; sleep 0.1 ; done

This sends each signal the keepalived does not use except ILL, BUS, FPE, KILL, SEGV, STOP to the keepalived process and epoll_wait() does not return an error for any of them. If signal 32 or signal 33 is send to the keepalived process, then I get the scheduler: epoll_wait error: 4 (Interrupted system call) log message.

The challenge is discovering why signal 32 or signal 33 is being sent to the keepalived parent process. However, this should have no effect whatsoever on the checker child process, although if the parent process has stopped responding properly, it will not be able to propagate a HUP signal to its child processes to initiate a reload. For me, after sending signal 32 and/or signal 33 to the parent process does not cause any problem to the parent process and it continues propagating signals to its child processes.

VigneshSP94 commented 3 years ago

Thanks for the detailed review.

Here is what we see in the PID's status file

SigBlk: 4000000000014a03
SigIgn: bffffffe7ffab034
SigCgt: 0000000180000000

We have compiled the keepalived with EINTR debug option, will report as soon as we see an err.

We suspect, as written in the previous bug report

Following the epoll_wait error, and assuming that keepalived continues to get the same error every time it calls epoll_wait (it only logs the first error to avoid filling up log files), keepalived will just be sitting in a loop calling epoll_wait and sleeping for 1 second. The healthchecker process will therefore not respond to the signal from its parent process to reload, since it receives the signal via a signalfd, and that is notified via the epoll_wait call. I can certainly add some code to make the keepalived process exit if the epoll_wait error is persistent, and it will then be reloaded by the parent process; this would provide some form of recovery.

The ipvs tables doesn't get updated as soon as we change weights, We suspect that it may come back to life when the keepalived parent process reloads/restarts the child healthcheck process

Not sure, if this information is relevant, we use fork type unit file in systemd rather than notify

[Unit]
Description = LBaaS Keepalive Daemon (LVS)
After = syslog.target network-online.target
Wants = network-online.target
ConditionFileNotEmpty = /keepalived.conf

[Service]
LimitNOFILE = 8192
Type = forking
ExecStart = /sbin/keepalived
ExecReload = /bin/kill -HUP $MAINPID
KillMode = process
Restart = always

[Install]
WantedBy = multi-user.target

pqarmitage commented 3 years ago

The two differences in signals is that you are ignoring signal 36, whereas for me it is blocked (this is the signal for dumping the configuration in JSON format and I assume that you do not have JSON support enabled in your build of keepalived), and signal 64 which likewise you are ignoring and I have blocked (this is a signal used for debugging purposes so it is right that you should be ignoring it).

This comes back to the question of why is keepalived receiving signal 32 to signal 33.

If you send signal 32 or signal 33 to the parent keepalived process, can you reproduce the problem? With the testing I have done with these signals I cannot make the parent process stop responding to signals and reloads continue working.

With regard to your quote: Following the epoll_wait error ..., this description is no longer current.

The current situation is that keepalived only logs the first occurrence of a specific errno returned from epoll_wait(). If a different errno is returned then that will be logged. If the errno is one of EBADF, EFAULT or EINVAL, then keepalived will sleep for one second. If the same error occurs 5 times in succession, and it is not the parent keepalived process that is experiencing the error, then the child process will exit, and the parent process will restart it. See the handling if if (ret < 0) after the call of epoll_wait() in lib/scheduler.c for more details.

You state The ipvs tables doesn't get updated as soon as we change weights. I assume you are using a reload to change the weights. What command are you executing to do that? What to the logs show when the command is executed.

If you can provide a copy of your complete configuration I will test that myself and see if I can work out what is happening. I am not sure how weight 0 will work without inhibit_on_failure being set.

Have you considered using FILE_CHECKs to set the weight of the real servers to 0, without having to reload the configuration?

VigneshSP94 commented 3 years ago

No, sending signal 32/33 didn't have any effects. So, it is not reproducible at will.

Unfortunately, ever since I compiled with debug EINTR, I didn't see this error. Just giving it more time.

The config is the same as below, but with ~20 VIPs and 3 real servers each VIP.

virtual_server 192.0.2.1 80 {
        delay_loop 3
        lvs_sched mh
        mh-port
        mh-fallback
        lvs_method TUN
        protocol TCP
        real_server 198.51.100.1 80 {
                weight 10
                HTTP_GET {
                        url {
                                path "/ping"
                                status_code 200
                        }
                        connect_ip 198.51.100.1
                        connect_port 9001
                        connect_timeout 2
                }
        }
        real_server 198.51.100.2 80 {
                weight 10
                HTTP_GET {
                        url {
                                path "/ping"
                                status_code 200
                        }
                        connect_ip 198.51.100.2
                        connect_port 9001
                        connect_timeout 2
                }
        }
        [... and more ...]
}
[... and more ...]

I will add the full config if the above isn't sufficient for testing.

About the FILE_CHECKs, we were unaware of this option to set weights, we will PoC this feature.

pqarmitage commented 3 years ago

@VigneshSP94 Compiling with --enable-eintr-debug=check will stop the scheduler: epoll_wait error: 4 (Interrupted system call) messages being logged. It should make no difference whether eintr-debug is enabled or not, other that the logged message. The only possibility is that epoll_wait() is continuously returning EINTR, which would be highly unexpected, and I will produce a patch to test for this.

Since all the signals that keepalived handles are blocked and handled via signalfd, I had expected that it would not be possible to EINTR to be returned by any system call, and hence by default the code does not check for EINTR. However, I had added the option to check for it with the configure option above, so that it could be tested for debugging purposes. I hadn't expected that signals 32 or 33 would ever be raised, since they are used by the pthreads library, and keepalived does not use pthreads multi-threading.

I will have a look and see if I can produce a patch that will allow keepalived to log which process has raised signal 32 or 33, but this will require circumventing some of the glibc code to stop processes doing anything with signals 32 and 33. So far as I can see glibc/pthreads uses signals 32 and 33 to signal between threads when either UIDs/GIDs are changed or for thread cancellation and POSIX timers. Since keepalived is not multi-threaded in this sense, then I do not see why the signals should be raised (if anyone can explain why that might be happening it would be very helpful).

It would be very helpful to see the full configuration so that I can test exactly the same as you are running, and it might be that we can see something in the configuration that could be causing this.

VigneshSP94 commented 3 years ago

@pqarmitage Sure, I've attached the full config file in this comment, just replaced the ip addresses with x.x.x.x keepalived.txt

Also, here is our systemd file if that would give some information.

[Unit]
Description = Keepalive Daemon (LVS)
After = syslog.target network-online.target
Wants = network-online.target
ConditionFileNotEmpty = /etc/keepalived.conf

[Service]
LimitNOFILE = 8192
Type = forking
ExecStart = /sbin/keepalived 
ExecReload = /bin/kill -HUP $MAINPID
KillMode = process
Restart = always

[Install]
WantedBy = multi-user.target

VigneshSP94 commented 3 years ago

@pqarmitage we see this issue on few newly provisioned nodes, unfortunately, these nodes are not compiled with debug EINTR option. Were you able to reproduce this by any chance?

Thank you !

pqarmitage commented 3 years ago

I haven't been able to reproduce the problem yet. I still need to produce a patch to attempt to identify what is sending signal 32 or signal 33 to the keepalived parent process.

In general I would not expect this problem to occur even without compiling with the --enable-eintr-debug=check option. However, given that you are experiencing the problem I would expect compiling with --enable-eintr-debug=check to resolve the problem.

@VigneshSP94 Do you have any idea what might be causing signal 32 or 33 to be being sent to the keepalived parent process?

pqarmitage commented 3 years ago

@VigneshSP94 I think the only way we are going to be able to work out what is happening is to add some diagnostics for when signal 32 or signal 33 is received, to attempt to determine what is originating the signal.

The attached patch makes keepalived hook into signals 32 and 33 and will log some information, such as:

May 03 16:37:14 hostnane Keepalived[21974]: Signal 32 received, code 0 pid 411023 uid 0, NOT from NPTL
May 03 16:37:14 hostnane Keepalived[21974]:   program was ../utils/code_test/sigtest_kill 32 21974 1

sigtest_kill is a test program which sends a signal to the specified process and waits for a second so that keepalived can read the process name (I have also attached sigtest_kill.c in case you want to test it).

The patch has to make some syscalls directly rather than using glibc for some of the signal related calls, because glibc attempts to stop a program doing anything with signals 32 and 33.

The patch chains some new code into the signal handling, and then calls the original signal handler, so the overall function should not be affected. However, since the patch uses a signalfd to receive the signals, it means that I expect you will not get any of the EINTR error messages that you have been experiencing, and you do NOT need to compile with the --enable-eintr-debug=check configure option.

What you will need to do is to run `grep -E "(Signal 3[23] received|program was )" on your log files to check for any occurrences of either of the signals being received, and then with some luck we should be able to identify where the signals are coming from. If you can post any such output in this issue it would be very helpful.

To run sigtest_kill, the parameters are signum, pid and wait time. So if you run sigtest_kill 32 PID 1 where PID is the pid of the parent keepalived process, you should see log entries like the example above.

Patch to add diagnostics for signals 32 and 33 sigtest_kill.c

The only reason at the moment that I can think of regarding why you are experiencing the problem of the parent process stopping responding to SIGHUP is that a library call other than epoll_wait() is returning with EINTR, and that is not being handled properly.

As mentioned above, with that attached patch applied you should not get any EINTR returns, since signals 32 and 33 will no longer be handled asynchronously, but the signals will be delivered via the signalfd. Once the patch is applied, if you look at the SigCgt entry in /proc/PID/status it should be 0000000000000000, meaning that no signals can be delivered asynchronously.

I look forward to receiving your feedback with, I hope, some log entries telling us what processes are generating these signals.

VigneshSP94 commented 3 years ago

Thank you so much for the patch, I'm applying it today. I will update the issue as soon as I see some logs.

pqarmitage commented 3 years ago

@VigneshSP94 Do you have any update following applying the patch?

VigneshSP94 commented 3 years ago

@pqarmitage I applied the patch and checking the logs every now and then, but no luck yet.

pqarmitage commented 3 years ago

@VigneshSP94 Do you have any update on this?

VigneshSP94 commented 3 years ago

@pqarmitage no, we did not catch it. Its been 2 months now. We are deciding to upgrade to 2.2.2 !

pqarmitage commented 3 years ago

@VigneshSP94 Can you confirm how you have built keepalived. In particular, did you use --enable-eintr-debug=check option, and did you apply my patch? Which version of keepalived have you been using for your testing?

VigneshSP94 commented 3 years ago

@pqarmitage I applied the patch on v2.2.2 and still watching it

~# keepalived -vv
Keepalived v2.2.2 (03/05,2021)

Copyright(C) 2001-2021 Alexandre Cassen, <acassen@gmail.com>

Built with kernel headers for Linux 4.15.18
Running on Linux 5.4.0-54-generic #60~18.04.1-Ubuntu SMP Fri Nov 6 17:25:16 UTC 2020
Distro: Ubuntu 18.04.1 LTS

configure options: --prefix=/usr/local/src/keepalived-2.2.2 --enable-eintr-debug=check

pqarmitage commented 3 years ago

@VigneshSP94 I think I may have discovered today, entirely by accident, what might have been causing epoll_wait() to return EINTR. I accidentally suspended (i.e. the system stopped but preserved the contents of RAM) my laptop while keepalived was running and I was monitoring the log output. After I resumed the system I noticed a couple of scheduler: epoll_wait error: 4 (Interrupted system call) log entries. I rebuilt keepalived with the patch I provided above, and this still occurred. I checked /proc/PID/status and it showed SigCgt: 0000000000000000, so no signal handlers should have been invoked.

After some further research I discovered in the signal(7) man page

Interruption of system calls and library functions by stop signals
       On Linux, even in the absence of signal handlers, certain blocking interfaces can fail with the error EINTR after the process is stopped by one of the stop signals  and
       then resumed via SIGCONT.  This behavior is not sanctioned by POSIX.1, and doesn't occur on other systems.

and one of those blocking interfaces is epoll_wait(). So on suspend (and presumably hibernate) a stop signal must be being sent to keepalived (and presumably every other process).

My understanding had been that EINTR could only be returned if a signal handler was run. We now have the situation that EINTR can be returned from some system calls even if no signal handler is installed, and from the signal(7) man page the only one of those system calls that keepalived uses is epoll_wait().

While I am not suggesting that SIGSTOP was the cause of the problem, it is now evident that we have to handle EINTR being returned to epoll_wait(), and handling that should resolve your problem.

I will have to consider what is the best way to deal with signals 32 and 33. I recollect from when I looked at the glibc NPTL code that one of those signal handlers was installed with SA_RESTART and the other wasn't, so there could be a number of system calls for which we need to handle EINTR.

pqarmitage commented 3 years ago

Checking for EINTR being returned by epoll_wait() was added in commit 93a17a0 three months ago. Since there have been no reports of epoll_wait errors I am closing this issue.

acassen / keepalived

Keepalived: scheduler: epoll_wait error: 4 #1883