Closed Rtoax closed 1 month ago
Could you please provide a copy of your keepalived configuration file. I'm going to need all the help I get to sort this one out.
I think it might also help if you could post the output from the following from gdb:
frame 5
print *thread
The problem seems to have occurred when a child process of the VRRP process exited, with an exit status of 1. I presume that this is some vrrp track script or notify script.
I also note that the backtrace lists a function start_vrrp_child.isra.0
. I have never seen this before - it is normally start_vrrp_child
. Do you know what is causing that?
@Rtoax Is this problem repeatable or has it just occurred the once?
This problem has been reappeared in versions 2.2.8 and 2.3.1, but never in version 2.2.4. Repetition method: Use this command to start the keepalived:
nohup /usr/sbin/keepalived -f /etc/keepalived/keepalived.conf --dont-fork --vrrp -D -S 0 &
The script then sends SIGHUP signals to the main Keepalived process every second to keep keepalived overloading the configuration.
#!/bin/bash
TARGET_PID=$1
while true; do
kill -SIGHUP "$TARGET_PID"
sleep 1
done
Waiting a day or two will repeat the null pointer access problem.
Ways to speed up reproduction: Adding the following print to the process_child_termination function, based on the replication method above, can greatly improve the replication efficiency. The problem of null pointer access can be repeated within a few minutes of running the program.
if (!thread_node)
return;
log_message(LOG_INFO, "%s(pid %d): rb_erase(t=%p pid=%d)\n", __func__, getpid(), thread, pid);
rb_erase(&thread->rb_data, &master->child_pid);
log_message(LOG_INFO, "%s(pid %d): rb_erase(t=%p pid=%d) done\n", __func__, getpid(), thread, pid);
thread->u.c.status = status;
Configuration Files:
global_defs {
enable_script_security
script_user root
max_auto_priority -1
vrrp_garp_master_refresh 60
}
# These are separate checks to provide the following behavior:
# If the loadbalanced endpoint is responding then all is well regardless
# of what the local api status is. Both checks will return success and
# we'll have the maximum priority. This means as long as there is a node
# with a functional loadbalancer it will get the VIP.
# If all of the loadbalancers go down but the local api is still running,
# the _both check will still succeed and allow any node with a functional
# api to take the VIP. This isn't preferred because it means all api
# traffic will go through one node, but at least it keeps the api available.
vrrp_script chk_ovs_alive_1 {
script "/usr/bin/timeout 6 ps -ef"
interval 2
weight 49
rise 3
fall 3
}
vrrp_script chk_ovs_alive_2 {
script "/usr/bin/timeout 5 ll"
interval 1
weight 11
rise 4
fall 2
}
vrrp_script chk_ovs_alive_3 {
script "/usr/bin/timeout 4.9 ping 127.0.0.1 -c 3 -i 1"
interval 1
weight 13
rise 3
fall 4
}
vrrp_script chk_ovs_alive_4 {
script "/usr/bin/timeout 4.9 ping 127.0.0.1 -c 3 -i 1"
interval 1
weight 50
rise 3
fall 2
}
vrrp_script chk_ovs_alive_5 {
script "/usr/bin/timeout 4.9 ping 127.0.0.1 -c 3 -i 1"
interval 1
weight 50
rise 3
fall 2
}
vrrp_script chk_ovs_alive_6 {
script "/usr/bin/timeout 4.9 ping 127.0.0.1 -c 3 -i 1"
interval 1
weight 50
rise 3
fall 2
}
vrrp_script chk_ovs_alive_7 {
script "/usr/bin/timeout 4.9 ping 127.0.0.1 -c 3 -i 1"
interval 1
weight 50
rise 3
fall 2
}
vrrp_script chk_ovs_alive_8 {
script "/usr/bin/timeout 4.9 ping 127.0.0.1 -c 3 -i 1"
interval 1
weight 50
rise 3
fall 2
}
vrrp_script chk_ovs_alive_9 {
script "/usr/bin/timeout 4.9 ping 127.0.0.1 -c 3 -i 1"
interval 1
weight 50
rise 3
fall 2
}
vrrp_script chk_ovs_alive_10 {
script "/usr/bin/timeout 4.9 ping 127.0.0.1 -c 3 -i 1"
interval 1
weight 50
rise 3
fall 2
}
vrrp_instance cluster24 {
state BACKUP
interface enp1s0
virtual_router_id 2
priority 40
advert_int 1
unicast_src_ip *.*.*.*
unicast_peer {
*.*.*.*
}
authentication {
auth_type PASS
auth_pass cluster24
}
virtual_ipaddress {
*.*.*.*/32
}
track_script {
chk_ovs_alive_1
chk_ovs_alive_2
chk_ovs_alive_3
chk_ovs_alive_4
chk_ovs_alive_5
chk_ovs_alive_6
chk_ovs_alive_7
chk_ovs_alive_8
chk_ovs_alive_9
chk_ovs_alive_10
}
}
@Zhiqiang-Lin Many thanks for the information above. I have found the cause of the problem, and now just need to work out a solution. The problem is caused by a script having terminated (or timed out), and the thread for processing the termination not having been run before the thread for processing the reload is run.
I have also identified a further problem while investigating this, which is that threads queued for running scripts are still queued after the reload, but they have pointers to the old script details which have been freed during the reload.
repeatable
Sorry for the late reply, it's repeatable problem i think.
This was a very difficult problem to track down. It only occurred when a track_script had timed out, the thread relating to the timeout had not yet been processed, and keepalived was signalled to reload. This caused a red-black tree to be corrupted, and subsequent use of that red-black tree could cause a segfault.
Commit 7e04261d resolves this issue.
Keepalived version
segv gdb