Regular load average spikes occurring with keepalived v1.3.5

solarmon commented 1 month ago

Hi,

Keepalived v1.3.5 (03/19,2017), git commit v1.3.5-6-g6fa32f2

Copyright(C) 2001-2017 Alexandre Cassen, <acassen@gmail.com>

Build options:  PIPE2 LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_PREF RTA_VIA FRA_OIFNAME FRA_SUPPRESS_PREFIXLEN FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK LIBIPTC LIBIPSET_DYNAMIC LVS LIBIPVS_NETLINK VRRP VRRP_AUTH VRRP_VMAC SOCK_NONBLOCK SOCK_CLOEXEC FIB_ROUTING INET6_ADDR_GEN_MODE SNMP_V3_FOR_V2 SNMP SNMP_KEEPALIVED SNMP_CHECKER SNMP_RFC SNMP_RFCV2 SNMP_RFCV3 SO_MARK

OS:

Name: CentOS
Version: 7.9
Architecture: x86_64

We are using Nagios and SNMP to monitor (using check_snmp_load_wizard.pl) the load of the CentOS virtual machine (in VMWare) hosting keepalived.

The monitoring shows that the load average spikes every 14 to 18 hours, for about 20 to 30mins.

The resource graphs for CPU, memory and disk usage does not seem to correlate with the the load average spikes.

When we stop the keepalived service, the load average spikes do not occur - which suggests they are related somehow to the keepalived service.

I understand that keepalived v.1.3.5 is very old, released in 2017 - we are using this version is it is what is available in the CentOS 7.9 repo we are using.

I will be looking to test and compare with newer versons of keepalived if/when I can find a way of getting it installed/deployed to our CentOS 7.9 VM.

Has there been any fixes/patches since keepalived v1.3.5 that may be related to this load average spike issue that we are seeing?

Thank you.

solarmon commented 1 month ago

Hi,

I managed find a keepalived v.2.2.1 RPM package at https://github.com/hnakamur/keepalived-rpm/ and was able to install that.

Unfortunately, the average load spikes are still occurring even on this version.

To give an example of the average load spikes, here the graph that shows them:

Note that keepalived v2.2.1 was installed at about 16:00 just before that last spike.

Here is the keepalived config on this (BACKUP) node:

global_defs {
    script_user <USER>
    enable_script_security
}

vrrp_script check_heplify {
    script "/etc/keepalived/keepalived_check_heplify.sh"
    interval 5
    timeout 2
    rise 2
    fall 2
    weight -100
}

vrrp_instance VIP1 {
        state BACKUP
        interface net1
        virtual_router_id 50
        #nopreempt
        priority 199
        advert_int 1
        authentication {
              auth_type PASS
              auth_pass <password>
        }
        unicast_src_ip <IP>
        unicast_peer {
                <IP>
        }
        virtual_ipaddress {
              <IP>/32 dev net1 label net1:vip1
        }
        track_script {
                check_heplify
        }
        notify_master "/etc/keepalived/keepalived_notify.sh"
        notify_backup "/etc/keepalived/keepalived_notify.sh"
        notify_fault "/etc/keepalived/keepalived_notify.sh"
        notify_stop "/etc/keepalived/keepalived_notify.sh"
        notify "/etc/keepalived/keepalived_notify.sh"
}

Since this is the BACKUP node, it has no load on it, yet the average load spikes still occur with this regularity and pattern.

Can anybody suggest how to troubleshoot how/why keepalived seems to be causing these average load spikes.

Is it worth disabling the track and notify scripts to see whether it changes anything?

pqarmitage commented 1 month ago

It might be worth posting your /etc/keepalived/keepalived_check_heplify.sh script. The notify scripts should not be causing any problem.

What you describe about load spikes every 14-18 hours for 20-30 minutes is not a problem with keepalived that has been reported before, and there is no clear reason why that might happen.

You could try running a script like the following in the background:

#!/bin/bash

while [ 1 ]; do
    echo; echo
    date >>/tmp/top.log
    top -b -n 1 >>/tmp/top.log
    echo
    ps -ef | grep keepalived >>/tmp/top.log
    sleep 300       # 5 minutes
done

If this script is running during one of the load spikes, it might give a better idea about what is happening.

solarmon commented 1 month ago

@pqarmitage

Thank you for taking the time to respond.

During the load spike, I use top/htop/atop but there doesn't seem to be anything that stands out to me. But attached is the log file based on the script you have suggested - for when using keepalived v1.3.5.

top.log

In the check and notify scripts, we do dump the keepalived data and stats using:

kill -USR1 $(cat /var/run/keepalived.pid)       # /tmp/keepalived.data
kill -USR2 $(cat /var/run/keepalived.pid)       # /tmp/keepalived.stats

Is this recommended to to be executed here? Since, it be done every time the check script is executed - it this case every 5 seconds, so it is not that intensive? Or should it only be done adhoc when needed?

pqarmitage commented 1 month ago

From top.log it is showing that keepalived is consistently showing process 31411 using very close to 0.15 secs CPU time every 5 minutes. Over the duration which the log file spans, that keepalived process uses 2.19 seconds CPU time in 1h15m. Extrapolating that to the CPU time that would have been used since 6 July (when keepalived started) it would be 1h10m43s. This is remarkably close to the total CPU time shown in the log for the keepalived process of 1h10m39s. So this would suggest that there is no regular spike in CPU utilisation of the keepalived process.

Now considering a spike of 4% CPU utilisation every 20 minutes since 6 July. That would amount to 1h47m CPU time. The total CPU utilisation of the 3 keepalived processes over that time has only been 1h28m, so this would seem not to account for the CPU time used by those spikes, even without allowing for the steady state CPU utilisation of pid 31411 of 1h10m.

So from the above, it doesn't appear to me that it is keepalived that is causing the problem. I do note that I think you probably need to run the script I provided for a whole day or more, and then when you have identified what time a CPU spike occurs you might be able to get a better look at what is happening on your system.

I do note that you appear to have 1.2Gb of available memory, but also that you have 69Mb of used swap space. Could this suggest that at some time, and maybe even periodically, so process or processes are run that cause swapping to occur?

I wouldn't choose to produce the keepalived.data and keepalived.stats files every 5 seconds if you don't need them that often, but it doesn't appear to be causing an undue load on the keepalived processes.

solarmon commented 1 month ago

@pqarmitage

Please note that the issue observed is about the load average spike - not a CPU usage spike. Or a spike (or high usage) of other resources like memory or disk I/O - not that I can find evidence of.

This is purely about the load average, as shown in top/atop/htop and as reported by the SNMP agent to Nagios.

The common thing with our nodes that show these load spikes is that they have keepalived running on them. And when keepalived is disabled then the load spike does not occur.

I've disabled the check and notify scripts completely on one keepalived node; and also disabled the dumping of the keepalived data and stats on another node; to see if that changes anything.

Regarding memory - the example top log was from a VM that had only 2G of memory allocated to it, but is the BACKUP node so it should be quite idle. However, we do have this issue occurring on various BACKUP nodes that have 6G and 16G of memory - so I don't think it is a memory resource issue.

pqarmitage commented 1 month ago

@solarmon How many CPUs are configured for the VM which produced the top.log posted above?

solarmon commented 1 month ago

@pqarmitage

For that example test VM, it is only 1 core (Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz).

But this issue has been observed on other VMs with other CPU combinations:

4 cores (Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz)
4 cores (Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz)
4 cores (Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz)

The issue occurs on both MASTER and STANDBY nodes.

pqarmitage commented 1 month ago

I am not sure I fully understand what load means, but the clearest descriptions I have found are at https://superuser.com/questions/1402079/understanding-load-average-and-cpu-in-linux and better still https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html.

On this basis, for a single CPU VM the load averages throughout from 09:13:38 to 10:08:41 are high, with the final two samples being low. I note that at 09:23:39 42.1% of CPU time is waiting on I/O. What is happening in top.log does not appear to match what you describe as a 20-30 minute peak in load to about 4, since the load appears to have been about 3 to 4 for the best part of an hour. During that time keepalived has used negligible CPU time, so it is difficult to see that keepalived could be doing anything to contribute to the high load. keepalived is designed not to wait on IO and for everything it does to be non blocking (except for the call to epoll()). Writing keepalived.data and keepalived.stats are exceptions to this, and have been known to cause keepalived to block especially if the files are written to remote (e.g. nfs) storage.

solarmon commented 1 month ago

@pqarmitage

My very simple explanation for load is that it is based on processes waiting for any resource - which could be a combination of CPU, memory, storage, and networking.

The example top log was actually for a load spike that occurred for longer than usual - for nearly two hours.

An update on my testing from yesterday - the keepalived node that had the keepalived.data and keepalived.stats dumps disabled from the check script (running every 5 seconds), this has not caused a load spike for 24 hours.

So the issue seems to be related, somehow, to the dumping of keepalived.data and keepalived.stats.

We had a Nagios/NCPA script that was reading these files to get the status of keepalived - so I'm not sure whether this could have an impact over a period of time?

For a workaround, I will keep keepalived.data and keepalived.stats dumping out of the check script, and perform it in the Nagios/NCPA script (every 5 minutes). But it will be good to get to the root cause of how/why dumping keepalived.data and keepalived.stats could lead to a load spike.

It seem to tie in with what you are saying about dumping of keepalived.data and keepalived.stats potentially could be blocking, somehow. But how this exactly leads to the load spikes will be interesting to figure out.

solarmon commented 1 month ago

@pqarmitage

Hhhmm...as soon as I posted that last message, I just noticed a load spike occurring on the keepalived node that I said I had made those changes (disabling dumping of keepalived.data and keepalived.stats in the check script, and putting in in the Nagios/NCPA script).

I'm not sure whether the enabling of the dumping of keepalived.data and keepalived.stats, even in the Nagios/Check script just causes the same thing.

solarmon commented 1 month ago

@pqarmitage

Another node, which had the check and notify scripts disabled, was not show the load spike for more than 24 hours. Then when I enabled keepalived.data and keepalived.stats dumping in the Nagios/NCPA script earlier this morning, the load spike occurred again just now:

I assume it doesn't matter what triggers keepalived.data and keepalived.stats dumping, it is still being done by keepalived.

I don't use keepalived.stats, so maybe I can stop dumping that and see if it makes a difference.

pqarmitage commented 1 month ago

It is difficult to see how dumping keepalived.data and keepalived.stats could be the cause of the problem, since you are doing it every 5 seconds and most of the time the load is very low, so it is something of a relief that stopping doing that does not stop the problem, albeit it seems to reduce the occurrence of it.

keepalived uses vfprintf() and fprintf() to write the keepalived.data and keepalived.stats files. I think these calls must block when buffers become full and need to be written out. Is it possible that when keepalived writes to these files that some other process is doing something (every 18 hours for 20-30 minutes) that causes the writes to be delayed and hence increase the load (I did notice a high waiting for IO percentage in the top output at 09:23:39). I really cannot see how keepalived can itself be doing anything different when the high load is happening.

solarmon commented 1 month ago

@pqarmitage

I didn't intend to suggest that keepalived itself was misbehaving or doing different. Just that when keepalived is running and whatever I am doing with it, is contributing to the load calculation because something is being blocked or held back for some reason.

CentOS 7 does maintenance of stale files in the /tmp directory - https://www.ucartz.com/clients/knowledgebase/1348/How-systemd-tmpfiles-cleans-up-ortmpor-or-orvarortmp-replacement-of-tmpwatch-in-CentOS-or-RHEL-7.html#:~:text=In%20CentOS%2FRHEL%207%2C%20tmpfiles,systemd%20on%20CentOS%2FRHEL%207. - so maybe the issue is related to that - but the maintenance should only for stale files that have not been touched for a period of time. However, I will see if there is a way to exclude the keepalived files from maintenance to eliminate this possibility.

solarmon commented 3 weeks ago

@pqarmitage

An update on my progress with this load spike troubleshooting.

Basically, I can't get to the ultimate root cause. But I know it has something to do with keepalived and it calling the check script.

The check script has logic to check whether a service is running, and if a file exists. The track_process and track_file features were not available on keepalived 1.3.5 - so that is why I implemented these checks in the check script.

I have now used keepalived 2.3.1 and have used the track_process and track_file features and now don't have an external check script. Now I don't have these large load spikes.

I can't explain what/why/how this change resolves the load spike issue, but it does.

The external check script had contained such logic:

SERVICE=<service_name>
if (systemctl -q is-active $SERVICE)
then
       echo "$SERVICE service is active"
else
        echo "$SERVICE service is not running - exit 1"
        exit 1
fi

FILE=<file_path>
if test -f "$FILE"; then
        echo "$FILE exists - exit 1"
        exit 1
fi

So it is very basic logic. I'm not sure how this could have caused any load spike.

pqarmitage commented 3 weeks ago

@solarmon Many thanks for your update. It is reassuring that using keepalived v2.3.1 and track_process/track_file stops the problem occurring. Have you tried using the track_script approach with keepalived v2.3.1 to see if that still exhibits the problem?

If you are happy with the current position, it would be helpful if you would close this issue. I don't think there is anything we can do with keepalived for this.

solarmon commented 3 weeks ago

@pqarmitage

Yes, I did initially try v2.2.1 and then v2.3.1 but it did not make a difference. The only time it made a difference is when keepalived was not running, and when the check script was called not.

So I basically offloaded what the check script was doing and that seems to resolve the issue - or at least prolong it, so I would probably need to waiting longer to see if the load spike occur again.

acassen / keepalived

Regular load average spikes occurring with keepalived v1.3.5 #2480