aioloswong commented 3 years ago

Describe the bug I have a k8s cluster with 3 masters. I am using keepalived as failover between the three masters. lzz.k8s.master1 192.168.0.62 lzz.k8s.master2 192.168.0.63 lzz.k8s.master3 192.168.0.64

The VIP is 192.168.0.60 which is binded to the three masters. There is a EIP binded to this VIP. So I use notify_master script to perform SNAT to let the three masters can access the internet. And aslo let the pod which is scheduled to the master(already grabbed the VIP) can access the internet.

When a master grabbed the VIP，the command "iptables -t nat -A POSTROUTING -s 192.168.0.0/24 ! -d 192.168.0.0/24 -o eth0 -j SNAT --to-source 192.168.0.60" in "vip_start.sh" was executed successfully. But the command in "calico_felix_config.sh" could not work properly. It always restarted and could not stop.

The SELINUX is disabled. This will only happen if keepalived is started up by "systemctl start keepalived". if I directly run the command "/usr/local/keepalived/sbin/keepalived -D" in shell, the command in "calico_felix_config.sh" will be successfully executed.

Keepalived version Keepalived v2.1.5 (07/13,2020)

Built with kernel headers for Linux 3.10.0 Running on Linux 5.10.2-1.el7.elrepo.x86_64 #1 SMP Sun Dec 20 09:53:23 EST 2020

configure options: --prefix=/usr/local/keepalived

Config options: LVS VRRP VRRP_AUTH OLD_CHKSUM_COMPAT FIB_ROUTING

System options: PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 IPV6_ADVANCED_API LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_PREF FRA_SUPPRES S_PREFIXLEN FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTA_VIA FRA_OIFNAME IFA_FLAGS IP_MULTICAST_ALL NET_LINUX_IF_H_COLLISION LIBIPTC_LINUX _NET_IF_H_COLLISION LIBIPVS_NETLINK VRRP_VMAC IFLA_LINK_NETNSID CN_PROC SOCK_NONBLOCK SOCK_CLOEXEC O_PATH GLOB_BRACE INET6_ADDR_GEN_MOD E SO_MARK SCHED_RESET_ON_FORK

I also tried the version 1.3.5 and 2.0.7. The problem also exists in the two versions.

Distro (please complete the following information):

Name [CentOS]
Version [7.6]
Architecture [x86_64]

Details of any containerisation or hosted service (e.g. AWS) HUAWEI Cloud. Keepalived was started up by command "systemctl start keepalived".

Configuration file: global_defs { router_id LVS_DEVEL } vrrp_instance VI_1 { state BACKUP nopreempt interface eth0 virtual_router_id 80 priority 100 advert_int 1 authentication { auth_type PASS auth_pass xxxx } virtual_ipaddress { 192.168.0.60 } notify_master /etc/keepalived/vip_start.sh notify_backup /etc/keepalived/vip_stop.sh notify_fault /etc/keepalived/vip_stop.sh notify_stop /etc/keepalived/vip_stop.sh } virtual_server 192.168.0.60 6443 { delay_loop 6 lb_algo loadbalance lb_kind DR net_mask 255.255.255.0 persistence_timeout 0 protocol TCP real_server 192.168.0.62 6443 { weight 1 SSL_GET { url { path /healthz status_code 200 } connect_timeout 3 nb_get_retry 3 delay_before_retry 3 } } real_server 192.168.0.63 6443 { weight 1 SSL_GET { url { path /healthz status_code 200 } connect_timeout 3 nb_get_retry 3 delay_before_retry 3 } } real_server 192.168.0.64 6443 { weight 1 SSL_GET { url { path /healthz status_code 200 } connect_timeout 3 nb_get_retry 3 delay_before_retry 3 } } }

The above configuration is for the node lzz.k8s.master1. The configurations for the node lzz.k8s.master2 and lzz.k8s.master3 are same as the above configuration, except the priority for lzz.k8s.master2 is 50 and the priority for lzz.k8s.master3 is 30.

Notify and track scripts The contents of "vip_start.sh" are as follows:

!/bin/bash

iptables -t nat -A POSTROUTING -s 192.168.0.0/24 ! -d 192.168.0.0/24 -o eth0 -j SNAT --to-source 192.168.0.60 sh /etc/keepalived/calico_felix_config.sh &

The contents of "calico_felix_config.sh" are as follows:

!/bin/bash

/usr/local/bin/calicoctl get felixconfig >> /etc/keepalived/calico.log exitCode=$? if [ $exitCode != 0 ] then echo "date;calico node not ready" >> /etc/keepalived/calico.log sleep 1 sh /etc/keepalived/calico_felix_config.sh & exit 0 fi /usr/local/bin/calicoctl delete felixconfig node.lzz.k8s.master1 /usr/local/bin/calicoctl delete felixconfig node.lzz.k8s.master2 /usr/local/bin/calicoctl delete felixconfig node.lzz.k8s.master3 /usr/local/bin/calicoctl apply -f /etc/keepalived/felix.yaml >> /etc/keepalived/calico.log echo "date;felix configuration created" >> /etc/keepalived/calico.log

The contents of "felix.yaml" are as follows: apiVersion: projectcalico.org/v3 kind: FelixConfiguration metadata: creationTimestamp: null name: node.lzz.k8s.master1 spec: bpfLogLevel: "" ipipEnabled: true logSeverityScreen: Info reportingInterval: 0s natOutgoingAddress: 192.168.0.60

The contents of "vip_stop.sh" are as follows:

!/bin/bash

iptables -t nat -D POSTROUTING -s 192.168.0.0/24 ! -d 192.168.0.0/24 -o eth0 -j SNAT --to-source 192.168.0.60

System Log entries Dec 30 01:30:36 lzz.k8s.master1 Keepalived[19596]: Starting Keepalived v2.1.5 (07/13,2020) Dec 30 01:30:36 lzz.k8s.master1 Keepalived[19596]: Running on Linux 5.10.2-1.el7.elrepo.x86_64 #1 SMP Sun Dec 20 09:53:23 EST 2020 (bui lt for Linux 3.10.0) Dec 30 01:30:36 lzz.k8s.master1 Keepalived[19596]: Command line: '/usr/local/keepalived/sbin/keepalived' '-D' Dec 30 01:30:36 lzz.k8s.master1 Keepalived[19596]: Opening file '/etc/keepalived/keepalived.conf'. Dec 30 01:30:36 lzz.k8s.master1 Keepalived[19597]: NOTICE: setting config option max_auto_priority should result in better keepalived p erformance Dec 30 01:30:36 lzz.k8s.master1 Keepalived[19597]: Starting Healthcheck child process, pid=19598 Dec 30 01:30:36 lzz.k8s.master1 Keepalived_healthcheckers[19598]: Opening file '/etc/keepalived/keepalived.conf'. Dec 30 01:30:36 lzz.k8s.master1 Keepalived_healthcheckers[19598]: (/etc/keepalived/keepalived.conf: Line 25) Invalid lvs_scheduler 'loa dbalance' - ignoring Dec 30 01:30:36 lzz.k8s.master1 Keepalived[19597]: Starting VRRP child process, pid=19599 Dec 30 01:30:36 lzz.k8s.master1 Keepalived_healthcheckers[19598]: (/etc/keepalived/keepalived.conf: Line 27) Unknown keyword 'net_mask' Dec 30 01:30:36 lzz.k8s.master1 Keepalived_healthcheckers[19598]: (/etc/keepalived/keepalived.conf: Line 28) number '0' outside range [1, 2678400] Dec 30 01:30:36 lzz.k8s.master1 Keepalived_healthcheckers[19598]: (/etc/keepalived/keepalived.conf: Line 28) persistence_timeout invalid Dec 30 01:30:36 lzz.k8s.master1 Keepalived_healthcheckers[19598]: (/etc/keepalived/keepalived.conf: Line 38) nb_get_retry is deprecated - please use 'retry' Dec 30 01:30:36 lzz.k8s.master1 Keepalived_healthcheckers[19598]: (/etc/keepalived/keepalived.conf: Line 50) nb_get_retry is deprecated - please use 'retry' Dec 30 01:30:36 lzz.k8s.master1 Keepalived_healthcheckers[19598]: (/etc/keepalived/keepalived.conf: Line 62) nb_get_retry is deprecated - please use 'retry' Dec 30 01:30:36 lzz.k8s.master1 Keepalived_healthcheckers[19598]: Virtual server [192.168.0.60]:tcp:6443: no scheduler set, setting default 'wlc' Dec 30 01:30:36 lzz.k8s.master1 Keepalived_healthcheckers[19598]: Initializing ipvs Dec 30 01:30:36 lzz.k8s.master1 Keepalived_healthcheckers[19598]: Gained quorum 1+0=1 <= 3 for VS [192.168.0.60]:tcp:6443 Dec 30 01:30:36 lzz.k8s.master1 Keepalived_healthcheckers[19598]: Activating healthchecker for service [192.168.0.62]:tcp:6443 for VS [192.168.0.60]:tcp:6443 Dec 30 01:30:36 lzz.k8s.master1 Keepalived_healthcheckers[19598]: Activating healthchecker for service [192.168.0.63]:tcp:6443 for VS [192.168.0.60]:tcp:6443 Dec 30 01:30:36 lzz.k8s.master1 Keepalived_healthcheckers[19598]: Activating healthchecker for service [192.168.0.64]:tcp:6443 for VS [192.168.0.60]:tcp:6443 Dec 30 01:30:36 lzz.k8s.master1 Keepalived_vrrp[19599]: Registering Kernel netlink reflector Dec 30 01:30:36 lzz.k8s.master1 Keepalived_vrrp[19599]: Registering Kernel netlink command channel Dec 30 01:30:36 lzz.k8s.master1 Keepalived_vrrp[19599]: Opening file '/etc/keepalived/keepalived.conf'. Dec 30 01:30:36 lzz.k8s.master1 Keepalived_vrrp[19599]: WARNING - default user 'keepalived_script' for script execution does not exist - please create. Dec 30 01:30:36 lzz.k8s.master1 Keepalived_vrrp[19599]: SECURITY VIOLATION - scripts are being executed but script_security not enabled. Dec 30 01:30:36 lzz.k8s.master1 Keepalived_vrrp[19599]: Assigned address 192.168.0.62 for interface eth0 Dec 30 01:30:36 lzz.k8s.master1 Keepalived_vrrp[19599]: Assigned address fe80::f816:3eff:fe5d:2f28 for interface eth0 Dec 30 01:30:36 lzz.k8s.master1 Keepalived_vrrp[19599]: Registering gratuitous ARP shared channel Dec 30 01:30:36 lzz.k8s.master1 Keepalived_vrrp[19599]: (VI_1) removing VIPs. Dec 30 01:30:36 lzz.k8s.master1 Keepalived_vrrp[19599]: (VI_1) Entering BACKUP STATE (init) Dec 30 01:30:36 lzz.k8s.master1 Keepalived_vrrp[19599]: VRRP sockpool: [ifindex( 2), family(IPv4), proto(112), fd(11,12)] Dec 30 01:30:38 lzz.k8s.master1 Keepalived_healthcheckers[19598]: Remote Web server [192.168.0.62]:tcp:6443 succeed on service. Dec 30 01:30:39 lzz.k8s.master1 Keepalived_healthcheckers[19598]: Remote Web server [192.168.0.63]:tcp:6443 succeed on service. Dec 30 01:30:42 lzz.k8s.master1 Keepalived_healthcheckers[19598]: Remote Web server [192.168.0.64]:tcp:6443 succeed on service. Dec 30 01:33:49 lzz.k8s.master1 Keepalived_vrrp[19599]: (VI_1) Backup received priority 0 advertisement Dec 30 01:33:50 lzz.k8s.master1 Keepalived_vrrp[19599]: (VI_1) Receive advertisement timeout Dec 30 01:33:50 lzz.k8s.master1 Keepalived_vrrp[19599]: (VI_1) Entering MASTER STATE Dec 30 01:33:50 lzz.k8s.master1 Keepalived_vrrp[19599]: (VI_1) setting VIPs. Dec 30 01:33:50 lzz.k8s.master1 Keepalived_vrrp[19599]: (VI_1) Sending/queueing gratuitous ARPs on eth0 for 192.168.0.60 Dec 30 01:33:50 lzz.k8s.master1 Keepalived_vrrp[19599]: Sending gratuitous ARP on eth0 for 192.168.0.60 Dec 30 01:33:50 lzz.k8s.master1 Keepalived_vrrp[19599]: Sending gratuitous ARP on eth0 for 192.168.0.60 Dec 30 01:33:50 lzz.k8s.master1 Keepalived_vrrp[19599]: Sending gratuitous ARP on eth0 for 192.168.0.60 Dec 30 01:33:50 lzz.k8s.master1 Keepalived_vrrp[19599]: Sending gratuitous ARP on eth0 for 192.168.0.60 Dec 30 01:33:50 lzz.k8s.master1 Keepalived_vrrp[19599]: Sending gratuitous ARP on eth0 for 192.168.0.60 Dec 30 01:33:55 lzz.k8s.master1 Keepalived_vrrp[19599]: (VI_1) Sending/queueing gratuitous ARPs on eth0 for 192.168.0.60 Dec 30 01:33:55 lzz.k8s.master1 Keepalived_vrrp[19599]: Sending gratuitous ARP on eth0 for 192.168.0.60 Dec 30 01:33:55 lzz.k8s.master1 Keepalived_vrrp[19599]: Sending gratuitous ARP on eth0 for 192.168.0.60 Dec 30 01:33:55 lzz.k8s.master1 Keepalived_vrrp[19599]: Sending gratuitous ARP on eth0 for 192.168.0.60 Dec 30 01:33:55 lzz.k8s.master1 Keepalived_vrrp[19599]: Sending gratuitous ARP on eth0 for 192.168.0.60 Dec 30 01:33:55 lzz.k8s.master1 Keepalived_vrrp[19599]: Sending gratuitous ARP on eth0 for 192.168.0.60

Additional context The contents in "calico.log" are are as follows: Failed to create Calico API client: invalid configuration: no configuration has been provided Wed Dec 30 01:42:16 CST 2020;calico node not ready

if I replace the commands in "calico_felix_config.sh" with follows:

!/bin/bash

kubectl get felixconfigurations.crd.projectcalico.org >> /etc/keepalived/calico.log exitCode=$? if [ $exitCode != 0 ] then echo "date;calico node not ready" >> /etc/keepalived/calico.log sleep 1 sh /etc/keepalived/calico_felix_config.sh & exit 0 fi kubectl delete felixconfigurations.crd.projectcalico.org node.lzz.k8s.master1 kubectl delete felixconfigurations.crd.projectcalico.org node.lzz.k8s.master2 kubectl delete felixconfigurations.crd.projectcalico.org node.lzz.k8s.master3 kubectl apply -f /etc/keepalived/k8s-felix.yaml >> /etc/keepalived/calico.log echo "date;felix configuration created" >> /etc/keepalived/calico.log

The contents of "k8s-felix.yaml" are as follows: apiVersion: crd.projectcalico.org/v1 kind: FelixConfiguration metadata: name: node.lzz.k8s.master2 spec: bpfLogLevel: "" ipipEnabled: true logSeverityScreen: Info natOutgoingAddress: 192.168.0.60 reportingInterval: 0s

The problem also exists.

pqarmitage commented 3 years ago

First of all, as a matter of course, you should resolve the configuration issues identified in the log before reporting an issue:

Dec 30 01:30:36 lzz.k8s.master1 Keepalived_healthcheckers[19598]: (/etc/keepalived/keepalived.conf: Line 27) Unknown keyword 'net_mask'
Dec 30 01:30:36 lzz.k8s.master1 Keepalived_healthcheckers[19598]: (/etc/keepalived/keepalived.conf: Line 28) number '0' outside range [1, 2678400]
Dec 30 01:30:36 lzz.k8s.master1 Keepalived_healthcheckers[19598]: (/etc/keepalived/keepalived.conf: Line 28) persistence_timeout invalid
Dec 30 01:30:36 lzz.k8s.master1 Keepalived_healthcheckers[19598]: (/etc/keepalived/keepalived.conf: Line 38) nb_get_retry is deprecated - please use 'retry'
Dec 30 01:30:36 lzz.k8s.master1 Keepalived_healthcheckers[19598]: (/etc/keepalived/keepalived.conf: Line 50) nb_get_retry is deprecated - please use 'retry'
Dec 30 01:30:36 lzz.k8s.master1 Keepalived_healthcheckers[19598]: (/etc/keepalived/keepalived.conf: Line 62) nb_get_retry is deprecated - please use 'retry'
Dec 30 01:30:36 lzz.k8s.master1 Keepalived_healthcheckers[19598]: Virtual server [192.168.0.60]:tcp:6443: no scheduler set, setting default 'wlc'

You identify that if you run keepalived from the shell, then everything works properly, so this is clearly not a keepalived issue, but rather more likely related to the environment that keepalived is being run in; keepalived itself cannot be responsible for the environment in which it is run.

You state that the calico.log shows

Failed to create Calico API client: invalid configuration: no configuration has been provided
Wed Dec 30 01:42:16 CST 2020;calico node not ready

so there is clearly an issue relating to running calicoctl when running in a systemd environment. Is there, for example, a shell environment variable for the path to the calico configuration file that is set when running keepalived directly from the shell, but is not set when running keepalived using systemd? Presumably there is a similar problem when running using kubectl. I think you will need to investigate what are the differences between the environment when keepalived is run from the shell, and the environment when run by systemd, especially you indicate the SELinux is not the issue. For example comparing /proc/PID/environ when running in the different ways may help.

Unfortunately I am not aware of any previous issue having been raised in relation to using Huawei Cloud, calicoctl or felix, and so there are no previous insights that we can point to.

Although there is nothing to indicate that this relates to the problem, I note that you appear to be running on RHEL7, which has a 3.10 kernel, whereas you are running a 5.10 kernel. Also you are running keepalived on a 5.10 kernel although it has been built using 3.10 kernel headers.

Since you indicate that keepalived runs successfully when running from the command line, this indicates that keepalived itself is not the problem, and so I am now closing this issue.

aioloswong commented 3 years ago

@pqarmitage Thanks for your reply. I noticed that you mentioned about comparing /proc/PID/environ when running in the different ways. And aslo I found following information when I joined a node into k8s cluster.

mkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config

The HOME environment variable in my ECS is “/root”. So I added "HOME=/root" to the keepalived environment file whose location is "/etc/sysconfig/keepalived". And then it works.

I think the issues of https://github.com/acassen/keepalived/issues/1376 and https://github.com/acassen/keepalived/issues/1371 may be the same reason.

acassen / keepalived

notify_master script does not work properly #1825

!/bin/bash

!/bin/bash

!/bin/bash

!/bin/bash