K8S resource update will cause the master/backup switching

Hey, guys, I'm back ...

In short:

when any k8s resources get updated(svc,pod,configmap), I observe the MASTER keepalived instance will switch to BACKUP. (version = 0.33), because the "Cleanup()" in every reload logic.

Wed May 15 09:31:51 2019: (vips) removing VIPs.
Wed May 15 09:31:51 2019: (vips) Entering BACKUP STATE (init)

(1) who will suffer ?

During this transition, if at this moment, the keep-alive connections will suffer from connection lost. (my test: running ab -c 200 -n 10000 -k http://$VIP/ ( the -k means enable keep-alive), during the MASTER/BACKUP transition, a few seconds, the client will fail apr_socket_recv: Connection reset by peer (104))

(2) suspicion:

I found Cleanup() is involved in Reload(), it seems unreasonable , because it will remove the VIP from the NIC, and cause master/backup switching.

// Reload sends SIGHUP to keepalived to reload the configuration.
func (k *keepalived) Reload() error {
    glog.Info("Waiting for keepalived to start")
    for !k.IsRunning() {
        time.Sleep(time.Second)
    }

    k.Cleanup()
    glog.Info("reloading keepalived")
    err := syscall.Kill(k.cmd.Process.Pid, syscall.SIGHUP)

I'm also digging the rational for Cleanup() here, And also hope you gurus can share a quick hint here.

(3) detail logs

the master log as below:

I0515 09:31:50.946543       8 keepalived.go:167] Waiting for keepalived to start
I0515 09:31:50.946589       8 keepalived.go:250] Cleanup: [10.6.*.112 10.6.*.101]
I0515 09:31:50.946596       8 keepalived.go:272] removing configured VIP 10.6.*.112
Wed May 15 09:31:50 2019: Netlink reflector reports IP 10.6.111.112 removed from ens192
Wed May 15 09:31:50 2019: (vips) Entering BACKUP STATE
Wed May 15 09:31:50 2019: (vips) sent 0 priority
Wed May 15 09:31:50 2019: (vips) removing VIPs.
I0515 09:31:51.146184       8 keepalived.go:272] removing configured VIP 10.6.111.101
Wed May 15 09:31:51 2019: Unknown VRID(13) received on interface(ens192). ignoring...
I0515 09:31:51.452042       8 keepalived.go:173] reloading keepalived
Wed May 15 09:31:51 2019: Reloading ...
Wed May 15 09:31:51 2019: Opening file '/etc/keepalived/keepalived.conf'.
Wed May 15 09:31:51 2019: Reloading
Wed May 15 09:31:51 2019: Got SIGHUP, reloading checker configuration
Wed May 15 09:31:51 2019: Reloading
Wed May 15 09:31:51 2019: Opening file '/etc/keepalived/keepalived.conf'.
Wed May 15 09:31:51 2019: Initializing ipvs
Wed May 15 09:31:51 2019: service [172.28.134.133]:tcp:80 no longer exist
Wed May 15 09:31:51 2019: Opening file '/etc/keepalived/keepalived.conf'.
Wed May 15 09:31:51 2019: Removing service [172.28.134.133]:tcp:80 from VS [10.6.111.112]:tcp:80
Wed May 15 09:31:51 2019: Activating healthchecker for service [172.28.51.80]:tcp:80 for VS [10.6.111.112]:tcp:80
Wed May 15 09:31:51 2019: Activating healthchecker for service [172.28.51.82]:tcp:80 for VS [10.6.111.112]:tcp:80
Wed May 15 09:31:51 2019: Activating healthchecker for service [172.28.51.85]:tcp:80 for VS [10.6.111.101]:tcp:80
Wed May 15 09:31:51 2019: Activating healthchecker for service [172.28.51.86]:tcp:80 for VS [10.6.111.101]:tcp:80
Wed May 15 09:31:51 2019: SECURITY VIOLATION - scripts are being executed but script_security not enabled.
Wed May 15 09:31:51 2019: (vips) Ignoring track_interface ens192 since own interface
Wed May 15 09:31:51 2019: Assigned address 10.6.*.43 for interface ens192
Wed May 15 09:31:51 2019: Assigned address fe80::250:56ff:feb4:3f33 for interface ens192
Wed May 15 09:31:51 2019: (vips) removing VIPs.
Wed May 15 09:31:51 2019: (vips) Entering BACKUP STATE (init)
Wed May 15 09:31:51 2019: VRRP sockpool: [ifindex(2), family(IPv4), proto(112), unicast(0), fd(11,12)]

when any k8s resources get updated(svc,pod,configmap)

It shouldn't be triggered on every svc, pod, configmap change in kubernetes. It should only be if specifically the keepalived configmap changes or any svc/endpoint that is related to a vip in that configmap changes. Are you seeing something different?

I'm also digging the rational for Cleanup() here, And also hope you gurus can share a quick hint here.

The intent was to cleanup VIPs on startup to fix duplicate VIPs issue since keepalived didn't do it properly itself. Since then, there were other changes that cleaning up in reload doesn't make sense anymore. Now that we have the health check I think we can actually remove the cleanup in reload. The health check now will trigger a shutdown if not MASTER and has a VIP, and shutdown calls Cleanup() to remove the duplicate VIP.

E0516 16:03:39.525379       6 main.go:464] Health check unsuccessful: BACKUP should not contain VIP 10.0.2.17
I0516 16:03:39.887737       6 main.go:325] Received SIGTERM, shutting down
I0516 16:03:39.887786       6 main.go:343] shutting down controller queues
I0516 16:03:39.887808       6 keepalived.go:252] Cleanup: [10.0.2.17]
I0516 16:03:39.887846       6 keepalived.go:274] removing configured VIP 10.0.2.17
I0516 16:03:40.061462       6 main.go:333] Exiting with 0
Thu May 16 16:03:40 2019: Stopping

I can do a PR for this.

aledbf / kube-keepalived-vip

K8S resource update will cause the master/backup switching #96