acassen / keepalived

Keepalived
https://www.keepalived.org
GNU General Public License v2.0
3.95k stars 736 forks source link

duplicated routes after reload #2076

Closed mstinsky closed 2 years ago

mstinsky commented 2 years ago

Describe the bug We are using keepalived in an openstack installation for the ha router implementation. After we upgraded keepalived from 1.3.9 to 2.2.4 we have problems with duplicated routes in the respective router namespaces after keepalived is reloaded by neutron.

Routes on a fresh keepalived start:

ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip r
default via x.x.244.67 dev qg-6c2ee5e0-ad
10.0.0.0/24 dev qr-15d63a29-8e proto kernel scope link src 10.0.0.10
169.254.0.0/24 dev ha-f64d319f-ed proto kernel scope link src 169.254.0.13
169.254.192.0/18 dev ha-f64d319f-ed proto kernel scope link src 169.254.197.12
x.x.244.64/26 dev qg-6c2ee5e0-ad proto kernel scope link src x.x.244.116
x.x.244.128/25 dev qg-6c2ee5e0-ad scope link

After that I can force the duplicated routes in the same way neutron is reloading keepalived. kill -HUP ${pid}

Routes after the reload of keepalived:

ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip r
default via x.x.244.67 dev qg-6c2ee5e0-ad proto 18
default via x.x.244.67 dev qg-6c2ee5e0-ad
10.0.0.0/24 dev qr-15d63a29-8e proto kernel scope link src 10.0.0.10
169.254.0.0/24 dev ha-f64d319f-ed proto kernel scope link src 169.254.0.13
169.254.192.0/18 dev ha-f64d319f-ed proto kernel scope link src 169.254.197.12
x.x.244.64/26 dev qg-6c2ee5e0-ad proto kernel scope link src x.x.244.116
x.x.244.128/25 dev qg-6c2ee5e0-ad proto 18 scope link
x.x.244.128/25 dev qg-6c2ee5e0-ad scope link

I am not observing the same behaviour with ipv6 routes. For ipv6 the proto 18 is just added to the existing route.

Ipv6 route before keepalived reload:

ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip -6 r
x:x:1003::/64 dev qg-6c2ee5e0-ad proto kernel metric 256 pref medium
fe80::/64 dev ha-f64d319f-ed proto kernel metric 256 pref medium
fe80::/64 dev qr-15d63a29-8e proto kernel metric 256 pref medium
fe80::/64 dev qg-6c2ee5e0-ad proto kernel metric 256 pref medium
default via x:x:1003::fffd dev qg-6c2ee5e0-ad metric 1024 pref medium

ipv6 route after keepalived reload:

ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip -6 r
x:x:1003::/64 dev qg-6c2ee5e0-ad proto kernel metric 256 pref medium
fe80::/64 dev ha-f64d319f-ed proto kernel metric 256 pref medium
fe80::/64 dev qr-15d63a29-8e proto kernel metric 256 pref medium
fe80::/64 dev qg-6c2ee5e0-ad proto kernel metric 256 pref medium
default via x:x:1003::fffd dev qg-6c2ee5e0-ad proto 18 metric 1024 pref medium

To Reproduce Any steps necessary to reproduce the behaviour: n/a

Expected behavior No duplicated rules with proto 18 after a keepalived reload.

Keepalived version

keepalived -v
Keepalived v2.2.4 (08/21,2021)

Copyright(C) 2001-2021 Alexandre Cassen, <acassen@gmail.com>

Built with kernel headers for Linux 4.15.18
Running on Linux 5.4.0-73-generic #82-Ubuntu SMP Wed Apr 14 17:39:42 UTC 2021
Distro: Ubuntu 18.04.6 LTS

configure options:

Config options:  LIBIPSET_DYNAMIC LVS VRRP VRRP_AUTH VRRP_VMAC OLD_CHKSUM_COMPAT INIT=SYSV

System options:  VSYSLOG MEMFD_CREATE IPV4_DEVCONF LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS FRA_L3MDEV FRA_UID_RANGE RTAX_FASTOPEN_NO_COOKIE RTA_VIA RTA_TTL_PROPAGATE IFA_FLAGS LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA LIBIPSET_PRE_V7 IPTABLES NET_LINUX_IF_H_COLLISION LIBIPVS_NETLINK IPVS_DEST_ATTR_ADDR_FAMILY IPVS_SYNCD_ATTRIBUTES IPVS_64BIT_STATS VRRP_IPVLAN IFLA_LINK_NETNSID GLOB_BRACE GLOB_ALTDIRFUNC INET6_ADDR_GEN_MODE VRF SO_MARK

Distro (please complete the following information):

Details of any containerisation or hosted service (e.g. AWS) We are running keepalived inside containers. In our case the base container is ubuntu:bionic.

Configuration file:

global_defs {
    notification_email_from neutron@openstack.local
    router_id neutron
}

vrrp_script ha_health_check_13 {
    script "/var/lib/neutron/ha_confs/76f69b0d-c9ac-4a98-851a-f74b23b2de49/ha_check_script_13.sh"
    interval 40
    fall 2
    rise 2
}

vrrp_instance VR_13 {
    state BACKUP
    interface ha-f64d319f-ed
    virtual_router_id 13
    priority 50
    garp_master_delay 60
    nopreempt
    advert_int 2
    track_interface {
        ha-f64d319f-ed
    }
    virtual_ipaddress {
        169.254.0.13/24 dev ha-f64d319f-ed
    }
    virtual_ipaddress_excluded {
        10.0.0.10/24 dev qr-15d63a29-8e no_track
        x.x.244.116/26 dev qg-6c2ee5e0-ad no_track
        x:x:1003::22b/64 dev qg-6c2ee5e0-ad no_track
        fe80::f816:3eff:fea9:cc49/64 dev qr-15d63a29-8e scope link no_track
        fe80::f816:3eff:fed2:6cc7/64 dev qg-6c2ee5e0-ad scope link no_track
    }
    virtual_routes {
        ::/0 via x:x:1003::fffd dev qg-6c2ee5e0-ad no_track
        0.0.0.0/0 via x.x.244.67 dev qg-6c2ee5e0-ad no_track
        x.x.244.128/25 dev qg-6c2ee5e0-ad scope link no_track
    }
    track_script {
        ha_health_check_13
    }
}

Notify and track scripts

#!/bin/bash -eu
ip a | grep fe80::f816:3eff:fea9:cc49 || exit 0
ping6 -c 1 -w 1 x:x:1003::fffd 1>/dev/null || exit 1
ping -c 1 -w 1 x.x.244.67 1>/dev/null || exit 1

System Log entries

Sat Jan  8 12:32:44 2022: Starting Keepalived v2.2.4 (08/21,2021)
Sat Jan  8 12:32:44 2022: Running on Linux 5.4.0-73-generic #82-Ubuntu SMP Wed Apr 14 17:39:42 UTC 2021 (built for Linux 4.15.18)
Sat Jan  8 12:32:44 2022: Command line: '/usr/sbin/keepalived' '-n' '-l' '-D' '-P' '-f'
Sat Jan  8 12:32:44 2022:               '/var/lib/neutron/ha_confs/76f69b0d-c9ac-4a98-851a-f74b23b2de49/keepalived.conf' '-p'
Sat Jan  8 12:32:44 2022:               '/var/lib/neutron/ha_confs/76f69b0d-c9ac-4a98-851a-f74b23b2de49.pid.keepalived' '-r'
Sat Jan  8 12:32:44 2022:               '/var/lib/neutron/ha_confs/76f69b0d-c9ac-4a98-851a-f74b23b2de49.pid.keepalived-vrrp'
Sat Jan  8 12:32:44 2022: Opening file '/var/lib/neutron/ha_confs/76f69b0d-c9ac-4a98-851a-f74b23b2de49/keepalived.conf'.
Sat Jan  8 12:32:44 2022: Configuration file /var/lib/neutron/ha_confs/76f69b0d-c9ac-4a98-851a-f74b23b2de49/keepalived.conf
Sat Jan  8 12:32:44 2022: NOTICE: setting config option max_auto_priority should result in better keepalived performance
Sat Jan  8 12:32:44 2022: Starting VRRP child process, pid=3235467
Sat Jan  8 12:32:44 2022: Registering Kernel netlink reflector
Sat Jan  8 12:32:44 2022: Registering Kernel netlink command channel
Sat Jan  8 12:32:44 2022: WARNING - default user 'keepalived_script' for script execution does not exist - please create.
Sat Jan  8 12:32:44 2022: (/var/lib/neutron/ha_confs/76f69b0d-c9ac-4a98-851a-f74b23b2de49/keepalived.conf: Line 31) Cannot specify scope for IPv6 addresses (fe80::f816:3eff:fea9:cc49/64) - ignoring scope
Sat Jan  8 12:32:44 2022: (/var/lib/neutron/ha_confs/76f69b0d-c9ac-4a98-851a-f74b23b2de49/keepalived.conf: Line 32) Cannot specify scope for IPv6 addresses (fe80::f816:3eff:fed2:6cc7/64) - ignoring scope
Sat Jan  8 12:32:44 2022: Startup complete
Sat Jan  8 12:32:44 2022: Unsafe permissions found for script '/var/lib/neutron/ha_confs/76f69b0d-c9ac-4a98-851a-f74b23b2de49/ha_check_script_13.sh'.
Sat Jan  8 12:32:44 2022: SECURITY VIOLATION - scripts are being executed but script_security not enabled. There are insecure scripts.
Sat Jan  8 12:32:44 2022: (VR_13) Ignoring track_interface ha-f64d319f-ed since own interface
Sat Jan  8 12:32:44 2022: Assigned address 169.254.197.12 for interface ha-f64d319f-ed
Sat Jan  8 12:32:44 2022: Assigned address fe80::f816:3eff:fe56:3f59 for interface ha-f64d319f-ed
Sat Jan  8 12:32:44 2022: Registering gratuitous ARP shared channel
Sat Jan  8 12:32:44 2022: Registering gratuitous NDISC shared channel
Sat Jan  8 12:32:44 2022: (VR_13) removing Virtual Routes
Sat Jan  8 12:32:44 2022: (VR_13) removing VIPs.
Sat Jan  8 12:32:44 2022: (VR_13) removing E-VIPs.
Sat Jan  8 12:32:44 2022: (VR_13) removing Virtual Routes
Sat Jan  8 12:32:44 2022: VRRP sockpool: [ifindex(1755), family(IPv4), proto(112), fd(14,15)]
Sat Jan  8 12:32:44 2022: VRRP_Script(ha_health_check_13) succeeded
Sat Jan  8 12:32:44 2022: (VR_13) Entering BACKUP STATE
Sat Jan  8 12:32:51 2022: (VR_13) Receive advertisement timeout
Sat Jan  8 12:32:51 2022: (VR_13) Entering MASTER STATE
Sat Jan  8 12:32:51 2022: (VR_13) setting VIPs.
Sat Jan  8 12:32:51 2022: (VR_13) setting E-VIPs.
Sat Jan  8 12:32:51 2022: (VR_13) setting Virtual Routes
Sat Jan  8 12:32:51 2022: Netlink: error: No route to host(113), type=RTM_NEWROUTE(24), seq=1641645180, pid=0
Sat Jan  8 12:32:51 2022: Netlink: error: Network is unreachable(101), type=RTM_NEWROUTE(24), seq=1641645181, pid=0
Sat Jan  8 12:32:51 2022: Netlink: error: Network is down(100), type=RTM_NEWROUTE(24), seq=1641645182, pid=0
Sat Jan  8 12:32:51 2022: (VR_13) Sending/queueing gratuitous ARPs on ha-f64d319f-ed for 169.254.0.13
Sat Jan  8 12:32:51 2022: Sending gratuitous ARP on ha-f64d319f-ed for 169.254.0.13
Sat Jan  8 12:32:51 2022: (VR_13) Sending/queueing gratuitous ARPs on qr-15d63a29-8e for 10.0.0.10
Sat Jan  8 12:32:51 2022: Sending gratuitous ARP on qr-15d63a29-8e for 10.0.0.10
Sat Jan  8 12:32:51 2022: (VR_13) Sending/queueing gratuitous ARPs on qg-6c2ee5e0-ad for x.x.244.116
Sat Jan  8 12:32:51 2022: Sending gratuitous ARP on qg-6c2ee5e0-ad for x.x.244.116
Sat Jan  8 12:32:51 2022: Error 100 (Network is down) sending gratuitous ARP on qg-6c2ee5e0-ad for x.x.244.116
Sat Jan  8 12:32:51 2022: (VR_13) Sending/queueing Unsolicited Neighbour Adverts on qg-6c2ee5e0-ad for x:x:1003::22b
Sat Jan  8 12:32:51 2022: Sending unsolicited Neighbour Advert on qg-6c2ee5e0-ad for x:x:1003::22b
Sat Jan  8 12:32:51 2022: Error 100 sending ndisc unsolicited neighbour advert on qg-6c2ee5e0-ad for x:x:1003::22b
Sat Jan  8 12:32:51 2022: (VR_13) Sending/queueing Unsolicited Neighbour Adverts on qr-15d63a29-8e for fe80::f816:3eff:fea9:cc49
Sat Jan  8 12:32:51 2022: Sending unsolicited Neighbour Advert on qr-15d63a29-8e for fe80::f816:3eff:fea9:cc49
Sat Jan  8 12:32:51 2022: (VR_13) Sending/queueing Unsolicited Neighbour Adverts on qg-6c2ee5e0-ad for fe80::f816:3eff:fed2:6cc7
Sat Jan  8 12:32:51 2022: Sending unsolicited Neighbour Advert on qg-6c2ee5e0-ad for fe80::f816:3eff:fed2:6cc7
Sat Jan  8 12:32:51 2022: Error 100 sending ndisc unsolicited neighbour advert on qg-6c2ee5e0-ad for fe80::f816:3eff:fed2:6cc7
Sat Jan  8 12:32:51 2022: Sending gratuitous ARP on ha-f64d319f-ed for 169.254.0.13
Sat Jan  8 12:32:51 2022: Sending gratuitous ARP on qr-15d63a29-8e for 10.0.0.10
Sat Jan  8 12:32:51 2022: Sending gratuitous ARP on qg-6c2ee5e0-ad for x.x.244.116
Sat Jan  8 12:32:51 2022: Error 100 (Network is down) sending gratuitous ARP on qg-6c2ee5e0-ad for x.x.244.116
Sat Jan  8 12:32:51 2022: Sending unsolicited Neighbour Advert on qg-6c2ee5e0-ad for x:x:1003::22b
Sat Jan  8 12:32:51 2022: Error 100 sending ndisc unsolicited neighbour advert on qg-6c2ee5e0-ad for x:x:1003::22b
Sat Jan  8 12:32:51 2022: Sending unsolicited Neighbour Advert on qr-15d63a29-8e for fe80::f816:3eff:fea9:cc49
Sat Jan  8 12:32:51 2022: Sending unsolicited Neighbour Advert on qg-6c2ee5e0-ad for fe80::f816:3eff:fed2:6cc7
Sat Jan  8 12:32:51 2022: Error 100 sending ndisc unsolicited neighbour advert on qg-6c2ee5e0-ad for fe80::f816:3eff:fed2:6cc7
Sat Jan  8 12:32:51 2022: Sending gratuitous ARP on ha-f64d319f-ed for 169.254.0.13
Sat Jan  8 12:32:51 2022: Sending gratuitous ARP on qr-15d63a29-8e for 10.0.0.10
Sat Jan  8 12:32:51 2022: Sending gratuitous ARP on qg-6c2ee5e0-ad for x.x.244.116
Sat Jan  8 12:32:51 2022: Error 100 (Network is down) sending gratuitous ARP on qg-6c2ee5e0-ad for x.x.244.116
Sat Jan  8 12:32:51 2022: Sending unsolicited Neighbour Advert on qg-6c2ee5e0-ad for x:x:1003::22b
Sat Jan  8 12:32:51 2022: Error 100 sending ndisc unsolicited neighbour advert on qg-6c2ee5e0-ad for x:x:1003::22b
Sat Jan  8 12:32:51 2022: Sending unsolicited Neighbour Advert on qr-15d63a29-8e for fe80::f816:3eff:fea9:cc49
Sat Jan  8 12:32:51 2022: Sending unsolicited Neighbour Advert on qg-6c2ee5e0-ad for fe80::f816:3eff:fed2:6cc7
Sat Jan  8 12:32:51 2022: Error 100 sending ndisc unsolicited neighbour advert on qg-6c2ee5e0-ad for fe80::f816:3eff:fed2:6cc7
Sat Jan  8 12:32:51 2022: Sending gratuitous ARP on ha-f64d319f-ed for 169.254.0.13
Sat Jan  8 12:32:51 2022: Sending gratuitous ARP on qr-15d63a29-8e for 10.0.0.10
Sat Jan  8 12:32:51 2022: Sending gratuitous ARP on qg-6c2ee5e0-ad for x.x.244.116
Sat Jan  8 12:32:51 2022: Error 100 (Network is down) sending gratuitous ARP on qg-6c2ee5e0-ad for x.x.244.116
Sat Jan  8 12:32:51 2022: Sending unsolicited Neighbour Advert on qg-6c2ee5e0-ad for x:x:1003::22b
Sat Jan  8 12:32:51 2022: Error 100 sending ndisc unsolicited neighbour advert on qg-6c2ee5e0-ad for x:x:1003::22b
Sat Jan  8 12:32:51 2022: Sending unsolicited Neighbour Advert on qr-15d63a29-8e for fe80::f816:3eff:fea9:cc49
Sat Jan  8 12:32:51 2022: Sending unsolicited Neighbour Advert on qg-6c2ee5e0-ad for fe80::f816:3eff:fed2:6cc7
Sat Jan  8 12:32:51 2022: Error 100 sending ndisc unsolicited neighbour advert on qg-6c2ee5e0-ad for fe80::f816:3eff:fed2:6cc7
Sat Jan  8 12:32:51 2022: Sending gratuitous ARP on ha-f64d319f-ed for 169.254.0.13
Sat Jan  8 12:32:51 2022: Sending gratuitous ARP on qr-15d63a29-8e for 10.0.0.10
Sat Jan  8 12:32:51 2022: Sending gratuitous ARP on qg-6c2ee5e0-ad for x.x.244.116
Sat Jan  8 12:32:51 2022: Error 100 (Network is down) sending gratuitous ARP on qg-6c2ee5e0-ad for x.x.244.116
Sat Jan  8 12:32:51 2022: Sending unsolicited Neighbour Advert on qg-6c2ee5e0-ad for x:x:1003::22b
Sat Jan  8 12:32:51 2022: Error 100 sending ndisc unsolicited neighbour advert on qg-6c2ee5e0-ad for x:x:1003::22b
Sat Jan  8 12:32:51 2022: Sending unsolicited Neighbour Advert on qr-15d63a29-8e for fe80::f816:3eff:fea9:cc49
Sat Jan  8 12:32:51 2022: Sending unsolicited Neighbour Advert on qg-6c2ee5e0-ad for fe80::f816:3eff:fed2:6cc7
Sat Jan  8 12:32:51 2022: Error 100 sending ndisc unsolicited neighbour advert on qg-6c2ee5e0-ad for fe80::f816:3eff:fed2:6cc7
Sat Jan  8 12:33:24 2022: Script `ha_health_check_13` now returning 1
Sat Jan  8 12:33:51 2022: (VR_13) Sending/queueing gratuitous ARPs on ha-f64d319f-ed for 169.254.0.13
Sat Jan  8 12:33:51 2022: Sending gratuitous ARP on ha-f64d319f-ed for 169.254.0.13
Sat Jan  8 12:33:51 2022: (VR_13) Sending/queueing gratuitous ARPs on qr-15d63a29-8e for 10.0.0.10
Sat Jan  8 12:33:51 2022: Sending gratuitous ARP on qr-15d63a29-8e for 10.0.0.10
Sat Jan  8 12:33:51 2022: (VR_13) Sending/queueing gratuitous ARPs on qg-6c2ee5e0-ad for x.x.244.116
Sat Jan  8 12:33:51 2022: Sending gratuitous ARP on qg-6c2ee5e0-ad for x.x.244.116
Sat Jan  8 12:33:51 2022: (VR_13) Sending/queueing Unsolicited Neighbour Adverts on qg-6c2ee5e0-ad for x:x:1003::22b
Sat Jan  8 12:33:51 2022: Sending unsolicited Neighbour Advert on qg-6c2ee5e0-ad for x:x:1003::22b
Sat Jan  8 12:33:51 2022: (VR_13) Sending/queueing Unsolicited Neighbour Adverts on qr-15d63a29-8e for fe80::f816:3eff:fea9:cc49
Sat Jan  8 12:33:51 2022: Sending unsolicited Neighbour Advert on qr-15d63a29-8e for fe80::f816:3eff:fea9:cc49
Sat Jan  8 12:33:51 2022: (VR_13) Sending/queueing Unsolicited Neighbour Adverts on qg-6c2ee5e0-ad for fe80::f816:3eff:fed2:6cc7
Sat Jan  8 12:33:51 2022: Sending unsolicited Neighbour Advert on qg-6c2ee5e0-ad for fe80::f816:3eff:fed2:6cc7
Sat Jan  8 12:33:51 2022: Sending gratuitous ARP on ha-f64d319f-ed for 169.254.0.13
Sat Jan  8 12:33:51 2022: Sending gratuitous ARP on qr-15d63a29-8e for 10.0.0.10
Sat Jan  8 12:33:51 2022: Sending gratuitous ARP on qg-6c2ee5e0-ad for x.x.244.116
Sat Jan  8 12:33:51 2022: Sending unsolicited Neighbour Advert on qg-6c2ee5e0-ad for x:x:1003::22b
Sat Jan  8 12:33:51 2022: Sending unsolicited Neighbour Advert on qr-15d63a29-8e for fe80::f816:3eff:fea9:cc49
Sat Jan  8 12:33:51 2022: Sending unsolicited Neighbour Advert on qg-6c2ee5e0-ad for fe80::f816:3eff:fed2:6cc7
Sat Jan  8 12:33:51 2022: Sending gratuitous ARP on ha-f64d319f-ed for 169.254.0.13
Sat Jan  8 12:33:51 2022: Sending gratuitous ARP on qr-15d63a29-8e for 10.0.0.10
Sat Jan  8 12:33:51 2022: Sending gratuitous ARP on qg-6c2ee5e0-ad for x.x.244.116
Sat Jan  8 12:33:51 2022: Sending unsolicited Neighbour Advert on qg-6c2ee5e0-ad for x:x:1003::22b
Sat Jan  8 12:33:51 2022: Sending unsolicited Neighbour Advert on qr-15d63a29-8e for fe80::f816:3eff:fea9:cc49
Sat Jan  8 12:33:51 2022: Sending unsolicited Neighbour Advert on qg-6c2ee5e0-ad for fe80::f816:3eff:fed2:6cc7
Sat Jan  8 12:33:51 2022: Sending gratuitous ARP on ha-f64d319f-ed for 169.254.0.13
Sat Jan  8 12:33:51 2022: Sending gratuitous ARP on qr-15d63a29-8e for 10.0.0.10
Sat Jan  8 12:33:51 2022: Sending gratuitous ARP on qg-6c2ee5e0-ad for x.x.244.116
Sat Jan  8 12:33:51 2022: Sending unsolicited Neighbour Advert on qg-6c2ee5e0-ad for x:x:1003::22b
Sat Jan  8 12:33:51 2022: Sending unsolicited Neighbour Advert on qr-15d63a29-8e for fe80::f816:3eff:fea9:cc49
Sat Jan  8 12:33:51 2022: Sending unsolicited Neighbour Advert on qg-6c2ee5e0-ad for fe80::f816:3eff:fed2:6cc7
Sat Jan  8 12:33:51 2022: Sending gratuitous ARP on ha-f64d319f-ed for 169.254.0.13
Sat Jan  8 12:33:51 2022: Sending gratuitous ARP on qr-15d63a29-8e for 10.0.0.10
Sat Jan  8 12:33:51 2022: Sending gratuitous ARP on qg-6c2ee5e0-ad for x.x.244.116
Sat Jan  8 12:33:51 2022: Sending unsolicited Neighbour Advert on qg-6c2ee5e0-ad for x:x:1003::22b
Sat Jan  8 12:33:51 2022: Sending unsolicited Neighbour Advert on qr-15d63a29-8e for fe80::f816:3eff:fea9:cc49
Sat Jan  8 12:33:51 2022: Sending unsolicited Neighbour Advert on qg-6c2ee5e0-ad for fe80::f816:3eff:fed2:6cc7
Sat Jan  8 12:34:04 2022: Script `ha_health_check_13` now returning 0
Sat Jan  8 12:35:56 2022: Reloading
Sat Jan  8 12:35:56 2022: (/var/lib/neutron/ha_confs/76f69b0d-c9ac-4a98-851a-f74b23b2de49/keepalived.conf: Line 31) Cannot specify scope for IPv6 addresses (fe80::f816:3eff:fea9:cc49/64) - ignoring scope
Sat Jan  8 12:35:56 2022: (/var/lib/neutron/ha_confs/76f69b0d-c9ac-4a98-851a-f74b23b2de49/keepalived.conf: Line 32) Cannot specify scope for IPv6 addresses (fe80::f816:3eff:fed2:6cc7/64) - ignoring scope
Sat Jan  8 12:35:56 2022: VRRP_Script(ha_health_check_13) considered successful on reload
Sat Jan  8 12:35:56 2022: Unsafe permissions found for script '/var/lib/neutron/ha_confs/76f69b0d-c9ac-4a98-851a-f74b23b2de49/ha_check_script_13.sh'.
Sat Jan  8 12:35:56 2022: SECURITY VIOLATION - scripts are being executed but script_security not enabled. There are insecure scripts.
Sat Jan  8 12:35:56 2022: (VR_13) Ignoring track_interface ha-f64d319f-ed since own interface
Sat Jan  8 12:35:56 2022: read eventfd count 1, num_reloading 0
Sat Jan  8 12:35:56 2022: Assigned address 169.254.197.12 for interface ha-f64d319f-ed
Sat Jan  8 12:35:56 2022: Assigned address fe80::f816:3eff:fe56:3f59 for interface ha-f64d319f-ed
Sat Jan  8 12:35:56 2022: (VR_13) setting VIPs.
Sat Jan  8 12:35:56 2022: (VR_13) setting E-VIPs.
Sat Jan  8 12:35:56 2022: (VR_13) setting Virtual Routes
Sat Jan  8 12:35:56 2022: VRRP sockpool: [ifindex(1755), family(IPv4), proto(112), fd(14,15)]

Did keepalived coredump? n/a

Additional context I am not 100% sure if this is a keepalived issue or something related to openstack neutron working together with keepalived 2.2.4. But as this happens on a manual keepalived reload, I have the feeling that their might happen something wrong in keepalived. Any help on this is highly appreciated.

pqarmitage commented 2 years ago

I am confused about what is actually happening here. In the logs it shows that qg-6c2ee5e0-ad is down, whereas the ip r output appears to show that it is not (for me it shows linkdown against routes whose interface is down). Also the logs show

Sat Jan  8 12:32:51 2022: Netlink: error: No route to host(113), type=RTM_NEWROUTE(24), seq=1641645180, pid=0
Sat Jan  8 12:32:51 2022: Netlink: error: Network is unreachable(101), type=RTM_NEWROUTE(24), seq=1641645181, pid=0
Sat Jan  8 12:32:51 2022: Netlink: error: Network is down(100), type=RTM_NEWROUTE(24), seq=1641645182, pid=0

which means that keepalived has not succeeding in adding the 3 virtual routes when VR_13 transitions to master, for the reasons stated.

From my testing, in order to get the Netlink: error:s above and the Error 100 (Network is down) sending gratuitous ARP messages, the link has to be administratively down (i.e. ip link set qg-6c2ee5e0-ad down).

It appears that some time between 12:32:51 and 12:33:24 interface qg-6c2ee5e0-ad was set up, and the reload then occurred after that.

What is really confusing is that if qg-6c2ee5e0-ad is set down administratively, the routes

default via x.x.244.67 dev qg-6c2ee5e0-ad
x.x.244.64/26 dev qg-6c2ee5e0-ad proto kernel scope link src x.x.244.116
x.x.244.128/25 dev qg-6c2ee5e0-ad scope link

are deleted. I therefore cannot see how the ip r output above can match with the keepalived logs.

I have tried your configuration, albeit not using openstack/neutron, and a simple load/reload of keepalived v2.2.4 does not exhibit the problem.

The routes added by keepalived v2.2.4 all have proto 18 (this is proto keepalived if you have an up-to-date iproute2 package), and you can see this in the ip r output after the reload. keepalived v1.3.9 does not add the proto 18 to the routes.

It looks to me as though the

default via x.x.244.67 dev qg-6c2ee5e0-ad
x.x.244.64/26 dev qg-6c2ee5e0-ad proto kernel scope link src x.x.244.116
x.x.244.128/25 dev qg-6c2ee5e0-ad scope link

routes were left over when keepalived v1.3.9 terminated, and therefore the eVIP x.x.244.116/26 must also have remained configured. I am guessing that keepalived v1.3.9 was killed with SIGKILL rather than SIGTERM.

So my theory is that keepalived v1.3.9 was killed with SIGKILL, causing it to leave the ipaddresses and routes configured. keepalived v2.2.4 was then started. qg-6c2ee5e0-ad was in some state such that the old routes configured on it were not deleted, but it was down and hence the log messages, and keepalvied v2.2.4 not adding the routes. qg-6c2ee5e0-ad came up at some time between 12:32:51 and 12:33:24 and keepalived reloaded its configuration at 12:35:56, and it was then able to add the routes (which specify proto 18).

I have tried your configuration, albeit not using openstack/neutron, and a simple load/reload of keepalived v2.2.4 does not exhibit the problem; indeed the routes added by keepalived always have proto 18. I have also tested it by manually creating the routes that I believe were left over by keepalived v1.3.9 and then started keepalived v2.2.4, and I can see the duplicate routes that you have observed.

If keepalived v2.2.4 is killed with SIGKILL and then started again, at startup it removes the VIPs and routes that were left over, and then when the instance becomes master again it adds them back. The reason that the keepalived v1.3.9 routes are removed are because keepalived tries to remove routes with proto 18, and they don't exist.

While there might be a bug in keepalived v1.3.9 that caused the routes and IP addresses not to be removed when keepalived terminated (although I suspect it more likely that they were left because the VRRP keepalived process was sent SIGKILL), I don't see anything wrong with what keepalived v2.2.4 is doing. It could be argued that keepalived v2.2.4 should attempt to remove the routes without proto 18 as well, but I think the circumstances of what is happening here are so specific (i.e. keepalived is abnormally terminated, and then a new version of keepalived is run) that I think it is not worth adding the code to handle that.

If you can shed any light on the state of qg-6c2ee5d0-ad so that it appears to be down but the old routes are not deleted, then I would be very interested to understand that.

pqarmitage commented 2 years ago

I have noticed that keepalived is reporting: Built with kernel headers for Linux 4.15.18 Running on Linux 5.4.0-73-generic #82-Ubuntu SMP Wed Apr 14 17:39:42 UTC 2021 Distro: Ubuntu 18.04.6 LTS

The kernel headers that you have built keepalived against (v4.15.18) do not match the kernel you are running on (5.4). It shouldn't cause a problem running keepalived built against kernel headers for an older kernel than it is running on, but it can certainly cause problems the other way around, since keepalived detects at build time what kernel features are available, based on the kernel headers that are installed.

While I doubt that the problem you have experienced is anything related to the mismatch of kernel/kernel headers, it is certainly something we will never test or support.

mstinsky commented 2 years ago

Thanks for taking a look into this, really appreciated!

Sorry for not being more detailed in what I have been posting. The logs are from one keepalived instance to where I am falling over too. neutron is setting the qg- interface to down state on any keepalived backup node and it takes a couple of moments after keepalived is elected to master to bring the qg- interface up, thats the reason the log shows the Netlink: error: errors.

The test was done on a fresh keepalived 2.2.4 without routes from 1.3.9 around.

Here is some states from backup to master and then a keepalived reload:

keepalived instance in a backup state:

ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
1755: ha-f64d319f-ed: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:56:3f:59 brd ff:ff:ff:ff:ff:ff
    inet 169.254.197.12/18 brd 169.254.255.255 scope global ha-f64d319f-ed
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe56:3f59/64 scope link
       valid_lft forever preferred_lft forever
1756: qr-15d63a29-8e: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:a9:cc:49 brd ff:ff:ff:ff:ff:ff
1757: qg-6c2ee5e0-ad: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether fa:16:3e:d2:6c:c7 brd ff:ff:ff:ff:ff:ff

ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip r
169.254.192.0/18 dev ha-f64d319f-ed proto kernel scope link src 169.254.197.12

After a failover to this node:

ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
1755: ha-f64d319f-ed: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:56:3f:59 brd ff:ff:ff:ff:ff:ff
    inet 169.254.197.12/18 brd 169.254.255.255 scope global ha-f64d319f-ed
       valid_lft forever preferred_lft forever
    inet 169.254.0.13/24 scope global ha-f64d319f-ed
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe56:3f59/64 scope link
       valid_lft forever preferred_lft forever
1756: qr-15d63a29-8e: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:a9:cc:49 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.10/24 scope global qr-15d63a29-8e
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fea9:cc49/64 scope link nodad
       valid_lft forever preferred_lft forever
1757: qg-6c2ee5e0-ad: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:d2:6c:c7 brd ff:ff:ff:ff:ff:ff
    inet x.x.244.116/26 scope global qg-6c2ee5e0-ad
       valid_lft forever preferred_lft forever
    inet6 x.x:1003::22b/64 scope global nodad
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fed2:6cc7/64 scope link nodad
       valid_lft forever preferred_lft forever

ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip r
default via x.x.244.67 dev qg-6c2ee5e0-ad
10.0.0.0/24 dev qr-15d63a29-8e proto kernel scope link src 10.0.0.10
169.254.0.0/24 dev ha-f64d319f-ed proto kernel scope link src 169.254.0.13
169.254.192.0/18 dev ha-f64d319f-ed proto kernel scope link src 169.254.197.12
x.x.244.64/26 dev qg-6c2ee5e0-ad proto kernel scope link src x.x.244.116
x.x.244.128/25 dev qg-6c2ee5e0-ad scope link

And then after I do a kill -HUP ${pid} against the master keepalived process: (interfaces stay the same)

ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip r
default via x.x.244.67 dev qg-6c2ee5e0-ad proto 18
default via x.x.244.67 dev qg-6c2ee5e0-ad
10.0.0.0/24 dev qr-15d63a29-8e proto kernel scope link src 10.0.0.10
169.254.0.0/24 dev ha-f64d319f-ed proto kernel scope link src 169.254.0.13
169.254.192.0/18 dev ha-f64d319f-ed proto kernel scope link src 169.254.197.12
x.x.244.64/26 dev qg-6c2ee5e0-ad proto kernel scope link src x.x.244.116
x.x.244.128/25 dev qg-6c2ee5e0-ad proto 18 scope link
x.x.244.128/25 dev qg-6c2ee5e0-ad scope link

So I get into this state without any upgrade process. Its the same keepalived 2.2.4 process which never terminates in the whole process I mentioned.

One question regarding the proto 18/keepalived, I suspect this came with keepalived 2.0.1 (https://github.com/acassen/keepalived/blob/master/ChangeLog#L2827)? Maybe this is not 100% supperted by openstack neutron, so I will most likely create a report on the neutron project if someone knows about it. The only thing that confuses me a bit is that the proto keepalived is not set on the initial failover but only after I do a reload of the keepalived process.

mstinsky commented 2 years ago

Okay I might just understand what is happening here. As you said keepalived is not able to create the virtual routes because the interfaces is down in the moment it is elected to master. I assume there is no retry after it failed. After taking a look into the neutron code I found that neutron also creates the same routes after bringing the qg- port up. Before keepalived implemented the proto 18 this was most likely not a problem because the routes were the same. Now after I or neutron reloads keepalived, keepalived is able to create the routes with proto 18 because the qg- interface is up on the reload event and we end up with duplicated routes.

I will create a neutron report to verify if my assumption is correct.

Is there any option to disable the functionality to set proto 18 on the routes for keepalived?

pqarmitage commented 2 years ago

I think what you have observed raises a number of questions.

  1. Why is neutron adding the routes? Are the routes in the neutron configuration, or has it somehow learned them from what keepalived was doing?
  2. Why is qg-6c2ee5e0-ad not up all the time? Is neutron somehow tracking what keepalived is doing and only ups the interface when it sees keepalived transition the VRRP instance to master state? If so, why?

The problem appears to be that keepalived and neutron are both trying to do the same thing, and that is always going to cause problems. From my perspective I would say stop neutron adding the VIPs and virtual routes that keepalived manages, and also stop neutron upping and downing qg-6c2ee5e0-ad, then keepalived should behave as you want. Alternatively, if you can't stop neutron adding the VIPs/routes (but I would then suggest that is a bug in neutron), then remove that from the keepalived configuration.

I have noticed that you have no_track configured on the virtual_ipaddress_excluded and the virtual routes. This is the reason that keepalived is failing to add the routes. If no_track were not specified, then the VRRP instance would be in fault state while qg-6c2ee5e0-ad is down, and only once qg-6c2ee5e0-ad came up would it transition to BACKUP and then MASTER state. I suspect no_track has been added as a workaround to neutron upping and downing the interface.

I have had a look at neutron keepalived configuration and the template configuration looks similar to yours. It certainly has the same configuration error as your configuration - specifying track_interface for the interface that the VRRP instance is using; keepalived automatically tracks the interface of the VRRP instance. The neutron template also specifies state SLAVE which is invalid, it should be state BACKUP as you have. But in any event, it is far better not to specify state at all and let keepalived deal with it, based on the priorities.

You ask if keepalived can be configured not to add proto 18. It can, but I don't think that is the right solution. I have suggested sorting out neutron so that it isn't adding the routes etc; if you can't do that, can neutron add the routes with proto 18. keepalived uses the protocol value to identify routes that it has added, and if that is changed it will not be able to manage them properly. In the keepalived configuration you could specify proto 0 or proto unspec against each of the virtual_routes, but I suspect in the long run you will be creating even more problems for yourself.

mstinsky commented 2 years ago

Thanks again for the detailed answer, all the insight helps a lot.

Regarding your questions:

  1. I dont know why neutron is trying to add the same routes as keepalived.
  2. The reason here if I remember correctly is that there is some potential traffic leak which breaks mac tables in the fabric. It was implemented here (https://review.opendev.org/c/openstack/neutron/+/707406/).

In the end I am not a neutron developer and have not much insight about the reasoning/decision making of the implementation. But I aggree with you the issue I am seeing seems to be an incompatibility between newer keepalived versions and neutron, so I created a report in the neutron project to see if someone has more insight and knows how to fix this on the neutron site. (https://bugs.launchpad.net/neutron/+bug/1956846)

I will test a patch for neutron in my test environment to set proto 0 on the routes as a workaround until there is a better solution.

And again thanks for your time looking into this and giving all the information! I will close this as this is not a keepalived issue.

pqarmitage commented 2 years ago

@mstinsky I have just merged commit aa48c33 which allows specifying add, prepend and append for routes.

The ip (iproute2) utility by default adds routes, which does not allow duplicate routes. This means that the same route with different protocols cannot both the installed. The kernel by default prepends routes, and this is what keepalived does (since the issue of add/prepend/append was not considered). The iproute2 utility can prepend and append routes using ip route prepend (this appears to be undocumented) and ip route append (which is documented).

When you specify a route in your keepalived configuration, you can now specify add or prepend (the default) or append.

Once you have commit aa48c33, if you change your virtual_route configuration to:

    virtual_routes {
        add ::/0 via x:x:1003::fffd dev qg-6c2ee5e0-ad no_track
        add 0.0.0.0/0 via x.x.244.67 dev qg-6c2ee5e0-ad no_track
        add x.x.244.128/25 dev qg-6c2ee5e0-ad scope link no_track
    }

(where add is specified on the line does not matter) then keepalived will fail to add the routes if they already exist, which I believe is what you want.