Closed drivera-armedia closed 3 years ago
All this said, we definitely need a feature where we can up the logging and see if the Keepalived process is even receiving or reacting to any VRRP traffic (and how)...
@drivera-armedia You state:
Other interesting behavior is that keepalived is not detecting the peer and both are assuming the role of master, regardless of what I try.
The reason for this is the following configuration snippet:
@dns1 virtual_router_id 101
@dns2 virtual_router_id 102
If you want VRRP instances to communicate with each other, they must have the same virtual_router_id (VRID). The purpose of the VRID is to allow independent groups of VRRP instances to operate over the same interfaces.
Replace the above two lines with:
virtual_router_id 101
and they should start talking to each other.
You also need to change the priority setting, so that the one you prefer to be master has the higher priority, e.g.
@dns1 priority 150
@dns2 priority 100
state BACKUP
is unnecessary, so you could remove it.
The unicast_peer
block
unicast_peer {
@dns1 192.168.3.252/22
@dns2 192.168.3.253/22
}
should be changed to:
unicast_peer {
@dns1 192.168.3.252
@dns2 192.168.3.253
}
since you are not specifying networks, but rather individual IP addresses.
None of the above relates to the segfault you have experienced, but it will enable your two systems to communicate with each other, so that one is MASTER and the other BACKUP.
I'll look further at what might be causing the segfault, but it is worrying that the chk_docker script appears to not be completing, and keepalived is reporting it as still running. I will look into this further, but I though you might find it useful to know about the configuration issues.
Thank you for those clarifications. The main concern remains the child track_script instance hanging as defunct ... I tried calling docker directly, and only resorted to setsid as an attempt to facilitate child reaping ... no dice!
I resorted to systemd dependency trickery to get what I wanted (the server to not service the shared IP when docker wasn't running). I would have much preferred to have it happen via "normal" means (i.e. track_script).
I realize this is probably a fairly old version (2.0.10), but I'm surprised it's having such problems with child reaping ... perhaps the timeout is a bit tight? (I'm using the default)
The log messages
Tue Sep 7 13:00:45 2021: Track script chk_docker is being timed out, expect idle - skipping run
Tue Sep 7 13:00:45 2021: Child (PID 10090) failed to terminate after kill
Tue Sep 7 13:00:45 2021: Child (PID 9844) failed to terminate after kill
Tue Sep 7 13:00:49 2021: Track script chk_docker is already running, expect idle - skipping run
Tue Sep 7 13:00:53 2021: Track script chk_docker is being timed out, expect idle - skipping run
Tue Sep 7 13:00:53 2021: Child (PID 10123) failed to terminate after kill
Tue Sep 7 13:00:53 2021: Child (PID 10109) failed to terminate after kill
Tue Sep 7 13:00:57 2021: Track script chk_docker is already running, expect idle - skipping run
appear to be the trigger of the problem. It seems that the /usr/bin/docker info
command is not completing and then keepalived is not properly handling the failure of the docker info
command to terminate.
Unfortunately I cannot remember if there have been any commits since v2.0.10 relating to fixing scripts failing to terminate (there have been almost 1600 commits in the nearly 3 years since v2.0.10 was released).
If you have a coredump from the segfault, and can provide a symbolic stack backtrace (using gdb) from the coredump, then we might be able to find the cause of the problem. Alternatively can you build keepalived v2.2.4 and try using that to see if you still get the segfault.
I have no experience of using docker, but I have tried running docker info
and that appears to start dockerd if it is not running, so I am not clear what is the purpose of the docker-ping script.
I have some Raspberry Pis running 32 bit kernels which appear to match all the software versions you have indicated. I could therefore attempt to reproduce what you are doing, but I would need to know the docker setup and what docker workload you are running.
Are you running on a Raspberry Pi? If so, it would be useful to know what Raspberry Pi model you are using, so that I can try to test on the same model (I have a variety of Raspberry Pi models).
On systems using Systemd there's a docker.socket unit that gets activated such that if anyone accesses the docker socket the daemon gets started if it's not up already. By disabling this docker can only be brought up or down explicitly which is what I've done. So by doing "docker info" I can test whether docker is running or not, and by disabling docker.socket that won't cause the service to get started by systemd.
I'll try to gather a coredump for you today.
It would be good to see the core dump stacktrace so that we can make sure that the segfault doesn't still exist.
An alternative that some people use to check whether a process is running is:
killall -0 dockerd
A better alternative, that doesn't involve having to make periodic checks, is to use track_process. This monitors processes starting and stopping, and so as soon as a process, such as dockerd terminates, then any monitoring vrrp instances will go to fault state, or change their priority, depending on how it is configured. Unfortunately I didn't add track_process until after v2.0.10 so to use track_process your would need to build a later version of keepalived.
The cause of the messages
Track script chk_docker is being timed out, expect idle - skipping run
Child (PID 10090) failed to terminate after kill
were fixed by keepalived v2.0.21.
Can you upgrade keepalived to at least v2.0.21, or better still v2.2.4, and see it that resolves your problem.
I'm trying to avoid recompiling. For now I've modified the healthcheck scripts to not use docker info as the means for health checks, and things seem to be running smoothly.
Those boxes are due to be upgraded to the latest Debian soon, so hopefully the new repos have updated packages.
Thanks!
Describe the bug Nature unknown, reporting as directed due to log message
To Reproduce Unknown.
Expected behavior Keepalived VRRP child process not segfault...?
Keepalived version
Distro (please complete the following information):
Details of any containerisation or hosted service (e.g. AWS) N/A
Configuration file:
Notify and track scripts
System Log entries
Did keepalived coredump?
Additional context
Other interesting behavior is that keepalived is not detecting the peer and both are assuming the role of master, regardless of what I try. I've modified the firewall rules and can see the VRRP traffic being exchanged, and I've also tried flat out disabling the firewall, but they refuse to assume a master-backup stance and both insist on becoming masters.