Kernel routes go missing after interface flaps

devicenull commented 3 years ago

This is with 30a276f3c144b3f1b2eba25d9b2bdc0a1cd6049a on CentOS 7 (running 3.10.0-1127.19.1.el7.x86_64)

[root@vultr local]# ip link add type veth peer veth2
[root@vultr local]# ip link set veth0 up
[root@vultr local]# ip link set veth1 up
[root@vultr local]# ip addr add 10.20.0.1/24 dev veth0
[root@vultr local]# ip addr add 10.20.0.2/24 dev veth0
[root@vultr local]# ip route add 10.20.20.0/24 via 10.20.0.2
[root@vultr local]# ip -4 route
default via 107.191.36.129 dev eth0
10.20.0.0/24 dev veth0 proto kernel scope link src 10.20.0.1
10.20.20.0/24 via 10.20.0.2 dev veth0
107.191.36.128/25 dev eth0 proto kernel scope link src 107.191.36.188
169.254.169.254 via 107.191.36.129 dev eth0 proto static
[root@vultr local]# vtysh

Hello, this is FRRouting (version 7.6-dev).
Copyright 1996-2005 Kunihiro Ishiguro, et al.

vultr.guest# sh ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure
K>* 0.0.0.0/0 [0/0] via 107.191.36.129, eth0, 00:08:42
C>* 10.20.0.0/24 is directly connected, veth0, 00:02:51
K>* 10.20.20.0/24 [0/0] via 10.20.0.2, veth0, 00:02:37
C>* 107.191.36.128/25 is directly connected, eth0, 00:08:42
K>* 169.254.169.254/32 [0/0] via 107.191.36.129, eth0, 00:08:42
vultr.guest#
[root@vultr local]# ip link set veth1 down
[root@vultr local]# vtysh

Hello, this is FRRouting (version 7.6-dev).
Copyright 1996-2005 Kunihiro Ishiguro, et al.

vultr.guest# sh ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure
K>* 0.0.0.0/0 [0/0] via 107.191.36.129, eth0, 00:08:59
C>* 10.0.0.0/24 is directly connected, eth1, 00:04:09
C>* 107.191.36.128/25 is directly connected, eth0, 00:08:59
K>* 169.254.169.254/32 [0/0] via 107.191.36.129, eth0, 00:08:59
vultr.guest#
[root@vultr local]# ip link set veth1 up
[root@vultr local]# ip -4 route
default via 107.191.36.129 dev eth0
10.20.0.0/24 dev veth0 proto kernel scope link src 10.20.0.1
10.20.20.0/24 via 10.20.0.2 dev veth0
107.191.36.128/25 dev eth0 proto kernel scope link src 107.191.36.188
169.254.169.254 via 107.191.36.129 dev eth0 proto static
[root@vultr local]# vtysh

Hello, this is FRRouting (version 7.6-dev).
Copyright 1996-2005 Kunihiro Ishiguro, et al.

vultr.guest# sh ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure
K>* 0.0.0.0/0 [0/0] via 107.191.36.129, eth0, 00:09:14
C>* 10.20.0.0/24 is directly connected, veth0, 00:00:07
C>* 107.191.36.128/25 is directly connected, eth0, 00:09:14
K>* 169.254.169.254/32 [0/0] via 107.191.36.129, eth0, 00:09:14
vultr.guest#

You'll note the 10.20.20.0/24 route goes missing from FRR, even though the kernel still knows about it.

2020/10/13 18:47:00 ZEBRA: netlink_parse_info: netlink-listen (NS 0) type RTM_NEWLINK(16), len=1268, seq=0, pid=0
2020/10/13 18:47:00 ZEBRA: RTM_NEWLINK update for veth1(5) sl_type 0 master 0 flags 0x1002
2020/10/13 18:47:00 ZEBRA: Intf veth1(5) has gone DOWN
2020/10/13 18:47:00 ZEBRA: MESSAGE: ZEBRA_INTERFACE_DOWN veth1(0)
2020/10/13 18:47:00 ZEBRA: MESSAGE: ZEBRA_INTERFACE_ADDRESS_DELETE fe80::8881:50ff:fe6b:57b2/64 on veth1(0)
2020/10/13 18:47:00 ZEBRA: rib_delnode: (0:254):fe80::/64: rn 0x10ef590, re 0x10ef690, removing
2020/10/13 18:47:00 ZEBRA: rib_delnode: (0:254):fe80::/64 (MRIB): rn 0x10ef720, re 0x125f910, removing
2020/10/13 18:47:00 ZEBRA: rib_update: Scheduled VRF (ALL), event RIB_UPDATE_KERNEL
2020/10/13 18:47:00 ZEBRA: netlink_parse_info: netlink-listen (NS 0) type RTM_DELNEIGH(29), len=88, seq=0, pid=0
2020/10/13 18:47:00 ZEBRA: netlink_parse_info: netlink-listen (NS 0) type RTM_DELNEIGH(29), len=88, seq=0, pid=0
2020/10/13 18:47:00 ZEBRA: netlink_parse_info: netlink-listen (NS 0) type RTM_DELNEIGH(29), len=88, seq=0, pid=0
2020/10/13 18:47:00 ZEBRA: netlink_parse_info: netlink-listen (NS 0) type RTM_DELROUTE(25), len=116, seq=0, pid=0
2020/10/13 18:47:00 ZEBRA: RTM_DELROUTE ipv6 unicast proto kernel NS 0
2020/10/13 18:47:00 ZEBRA: rib_update_handle_vrf_all: Handling VRF (ALL) event RIB_UPDATE_KERNEL
2020/10/13 18:47:00 ZEBRA: rib_update_table: IPv4 VRF default Table 254 event RIB_UPDATE_KERNEL
2020/10/13 18:47:00 ZEBRA: rib_update_table: IPv4 VRF default Table 254 event RIB_UPDATE_KERNEL
2020/10/13 18:47:00 ZEBRA: rib_update_table: IPv6 VRF default Table 254 event RIB_UPDATE_KERNEL
2020/10/13 18:47:00 ZEBRA: rib_update_table: IPv6 VRF default Table 254 event RIB_UPDATE_KERNEL
2020/10/13 18:47:00 ZEBRA: netlink_parse_info: netlink-listen (NS 0) type RTM_DELROUTE(25), len=116, seq=0, pid=0
2020/10/13 18:47:00 ZEBRA: RTM_DELROUTE ipv6 unicast proto boot NS 0
2020/10/13 18:47:00 ZEBRA: netlink_parse_info: netlink-listen (NS 0) type RTM_DELADDR(21), len=72, seq=0, pid=0
2020/10/13 18:47:00 ZEBRA: netlink_interface_addr RTM_DELADDR veth1 flags 0x80:
2020/10/13 18:47:00 ZEBRA:   IFA_ADDRESS   fe80::8881:50ff:fe6b:57b2/64
2020/10/13 18:47:00 ZEBRA:   IFA_CACHEINFO pref -1, valid -1
2020/10/13 18:47:00 ZEBRA: MESSAGE: ZEBRA_INTERFACE_ADDRESS_DELETE fe80::8881:50ff:fe6b:57b2/64 on veth1(0)
2020/10/13 18:47:00 ZEBRA: rib_delete: (0:254):fe80::/64: via :: ifindex 5 type 2 doesn't exist in rib
2020/10/13 18:47:00 ZEBRA: rib_delete: (0:254):fe80::/64 (MRIB): via :: ifindex 5 type 2 doesn't exist in rib
2020/10/13 18:47:00 ZEBRA: rib_update: Scheduled VRF (ALL), event RIB_UPDATE_KERNEL
2020/10/13 18:47:00 ZEBRA: netlink_parse_info: netlink-listen (NS 0) type RTM_DELROUTE(25), len=116, seq=0, pid=0
2020/10/13 18:47:00 ZEBRA: RTM_DELROUTE ipv6 local proto  NS 0
2020/10/13 18:47:00 ZEBRA: Route rtm_type: local(2) intentionally ignoring
2020/10/13 18:47:00 ZEBRA: netlink_parse_info: netlink-listen (NS 0) type RTM_NEWLINK(16), len=1268, seq=0, pid=0
2020/10/13 18:47:00 ZEBRA: RTM_NEWLINK update for veth0(4) sl_type 0 master 0 flags 0x1003
2020/10/13 18:47:00 ZEBRA: Intf veth0(4) has gone DOWN
2020/10/13 18:47:00 ZEBRA: MESSAGE: ZEBRA_INTERFACE_DOWN veth0(0)
2020/10/13 18:47:00 ZEBRA: MESSAGE: ZEBRA_INTERFACE_ADDRESS_DELETE fe80::5837:f1ff:fee7:2ac2/64 on veth0(0)
2020/10/13 18:47:00 ZEBRA: rib_delnode: (0:254):fe80::/64: rn 0x10ef590, re 0x12600e0, removing
2020/10/13 18:47:00 ZEBRA: rib_delnode: (0:254):fe80::/64 (MRIB): rn 0x10ef720, re 0x12602c0, removing
2020/10/13 18:47:00 ZEBRA: MESSAGE: ZEBRA_INTERFACE_ADDRESS_DELETE 10.20.0.1/24 on veth0(0)
2020/10/13 18:47:00 ZEBRA: rib_delnode: (0:254):10.20.0.0/24: rn 0x1260430, re 0x125feb0, removing
2020/10/13 18:47:00 ZEBRA: rib_delnode: (0:254):10.20.0.0/24 (MRIB): rn 0x125f7b0, re 0x125f8b0, removing
2020/10/13 18:47:00 ZEBRA: MESSAGE: ZEBRA_INTERFACE_ADDRESS_DELETE 10.20.0.2/24 on veth0(0)
2020/10/13 18:47:00 ZEBRA: rib_delete: (0:254):10.20.0.0/24: via 0.0.0.0 ifindex 4 type 2 doesn't exist in rib
2020/10/13 18:47:00 ZEBRA: rib_delete: (0:254):10.20.0.0/24 (MRIB): via 0.0.0.0 ifindex 4 type 2 doesn't exist in rib
2020/10/13 18:47:00 ZEBRA: rib_update_handle_vrf_all: Handling VRF (ALL) event RIB_UPDATE_KERNEL
2020/10/13 18:47:00 ZEBRA: rib_update_table: IPv4 VRF default Table 254 event RIB_UPDATE_KERNEL
2020/10/13 18:47:00 ZEBRA: rib_update_table: IPv4 VRF default Table 254 event RIB_UPDATE_KERNEL
2020/10/13 18:47:00 ZEBRA: rib_update_table: IPv6 VRF default Table 254 event RIB_UPDATE_KERNEL
2020/10/13 18:47:00 ZEBRA: rib_update_table: IPv6 VRF default Table 254 event RIB_UPDATE_KERNEL
2020/10/13 18:47:00 ZEBRA: rib_process: (0:254):fe80::/64: rn 0x10ef590, removing re 0x10ef690
2020/10/13 18:47:00 ZEBRA: rib_unlink: (0:254):fe80::/64: rn 0x10ef590, re 0x10ef690
2020/10/13 18:47:00 ZEBRA: rib_process: (0:254):fe80::/64: rn 0x10ef590, removing re 0x12600e0
2020/10/13 18:47:00 ZEBRA: rib_unlink: (0:254):fe80::/64: rn 0x10ef590, re 0x12600e0
2020/10/13 18:47:00 ZEBRA: default(0:254):fe80::/64: Adding route rn 0x10ef720, re 0x10ef3e0 (connected)
2020/10/13 18:47:00 ZEBRA: rib_process: (0:254):fe80::/64 (MRIB): rn 0x10ef720, removing re 0x125f910
2020/10/13 18:47:00 ZEBRA: rib_unlink: (0:254):fe80::/64 (MRIB): rn 0x10ef720, re 0x125f910
2020/10/13 18:47:00 ZEBRA: rib_process: (0:254):fe80::/64 (MRIB): rn 0x10ef720, removing re 0x12602c0
2020/10/13 18:47:00 ZEBRA: rib_unlink: (0:254):fe80::/64 (MRIB): rn 0x10ef720, re 0x12602c0
2020/10/13 18:47:00 ZEBRA: default(0:254):10.20.0.0/24: Deleting route rn 0x1260430, re 0x125feb0 (connected)
2020/10/13 18:47:00 ZEBRA: 0:10.20.0.0/24: Redist del: re 0x125feb0 (connected), new re 0x0 (None)
2020/10/13 18:47:00 ZEBRA: rib_process: (0:254):10.20.0.0/24: rn 0x1260430, removing re 0x125feb0
2020/10/13 18:47:00 ZEBRA: rib_unlink: (0:254):10.20.0.0/24: rn 0x1260430, re 0x125feb0
2020/10/13 18:47:00 ZEBRA: rib_gc_dest: (0:0):10.20.0.0/24: removing dest from table
2020/10/13 18:47:00 ZEBRA: 0:10.20.0.0/24: Redist del: re 0x125f8b0 (connected), new re 0x0 (None)
2020/10/13 18:47:00 ZEBRA: rib_process: (0:254):10.20.0.0/24 (MRIB): rn 0x125f7b0, removing re 0x125f8b0
2020/10/13 18:47:00 ZEBRA: rib_unlink: (0:254):10.20.0.0/24 (MRIB): rn 0x125f7b0, re 0x125f8b0
2020/10/13 18:47:00 ZEBRA: rib_gc_dest: (0:0):10.20.0.0/24 (MRIB): removing dest from table
2020/10/13 18:47:00 ZEBRA: default(0:254):10.20.20.0/24: Deleting route rn 0x1260cf0, re 0x1260ad0 (kernel)
2020/10/13 18:47:00 ZEBRA: 0:10.20.20.0/24: Redist del: re 0x1260ad0 (kernel), new re 0x0 (None)
2020/10/13 18:47:00 ZEBRA: rib_process: (0:254):10.20.20.0/24: rn 0x1260cf0, removing re 0x1260ad0
2020/10/13 18:47:00 ZEBRA: rib_unlink: (0:254):10.20.20.0/24: rn 0x1260cf0, re 0x1260ad0
2020/10/13 18:47:00 ZEBRA: rib_gc_dest: (0:0):10.20.20.0/24: removing dest from table
2020/10/13 18:47:00 ZEBRA: default(0:254):fe80::/64 update_from_ctx(): no fib nhg
2020/10/13 18:47:00 ZEBRA: default(0:254):fe80::/64 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:00 ZEBRA: (0:254):fe80::/64: Redist update re 0x10ef360 (connected), old 0x0 (None)
2020/10/13 18:47:00 ZEBRA: Redist update filter prefix fe80::/64
2020/10/13 18:47:00 ZEBRA: default(0:254):0.0.0.0/0 update_from_ctx(): no fib nhg
2020/10/13 18:47:00 ZEBRA: default(0:254):0.0.0.0/0 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:00 ZEBRA: (0:254):0.0.0.0/0: Redist update re 0x10efa90 (kernel), old 0x0 (None)
2020/10/13 18:47:00 ZEBRA: default(0:254):169.254.169.254/32 update_from_ctx(): no fib nhg
2020/10/13 18:47:00 ZEBRA: default(0:254):169.254.169.254/32 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:00 ZEBRA: (0:254):169.254.169.254/32: Redist update re 0x10f0050 (kernel), old 0x0 (None)
2020/10/13 18:47:00 ZEBRA: Redist update filter prefix 169.254.169.254/32
2020/10/13 18:47:00 ZEBRA: default(0:254):::/96 update_from_ctx(): no fib nhg
2020/10/13 18:47:00 ZEBRA: default(0:254):::/96 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:00 ZEBRA: (0:254):::/96: Redist update re 0x10f01c0 (kernel), old 0x0 (None)
2020/10/13 18:47:00 ZEBRA: default(0:254):::ffff:0:0/96 update_from_ctx(): no fib nhg
2020/10/13 18:47:00 ZEBRA: default(0:254):::ffff:0:0/96 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:00 ZEBRA: (0:254):::ffff:0:0/96: Redist update re 0x10f03c0 (kernel), old 0x0 (None)
2020/10/13 18:47:00 ZEBRA: default(0:254):2002:a00::/24 update_from_ctx(): no fib nhg
2020/10/13 18:47:00 ZEBRA: default(0:254):2002:a00::/24 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:00 ZEBRA: (0:254):2002:a00::/24: Redist update re 0x10f06a0 (kernel), old 0x0 (None)
2020/10/13 18:47:00 ZEBRA: default(0:254):2002:7f00::/24 update_from_ctx(): no fib nhg
2020/10/13 18:47:00 ZEBRA: default(0:254):2002:7f00::/24 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:00 ZEBRA: (0:254):2002:7f00::/24: Redist update re 0x10f0890 (kernel), old 0x0 (None)
2020/10/13 18:47:00 ZEBRA: default(0:254):2002:a9fe::/32 update_from_ctx(): no fib nhg
2020/10/13 18:47:00 ZEBRA: default(0:254):2002:a9fe::/32 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:00 ZEBRA: (0:254):2002:a9fe::/32: Redist update re 0x10f0a80 (kernel), old 0x0 (None)
2020/10/13 18:47:00 ZEBRA: default(0:254):2002:ac10::/28 update_from_ctx(): no fib nhg
2020/10/13 18:47:00 ZEBRA: default(0:254):2002:ac10::/28 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:00 ZEBRA: (0:254):2002:ac10::/28: Redist update re 0x10f0d00 (kernel), old 0x0 (None)
2020/10/13 18:47:00 ZEBRA: default(0:254):2002:c0a8::/32 update_from_ctx(): no fib nhg
2020/10/13 18:47:00 ZEBRA: default(0:254):2002:c0a8::/32 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:00 ZEBRA: (0:254):2002:c0a8::/32: Redist update re 0x10f0ef0 (kernel), old 0x0 (None)
2020/10/13 18:47:00 ZEBRA: default(0:254):2002:e000::/19 update_from_ctx(): no fib nhg
2020/10/13 18:47:00 ZEBRA: default(0:254):2002:e000::/19 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:00 ZEBRA: (0:254):2002:e000::/19: Redist update re 0x10f10e0 (kernel), old 0x0 (None)
2020/10/13 18:47:00 ZEBRA: default(0:254):3ffe:ffff::/32 update_from_ctx(): no fib nhg
2020/10/13 18:47:00 ZEBRA: default(0:254):3ffe:ffff::/32 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:00 ZEBRA: (0:254):3ffe:ffff::/32: Redist update re 0x10f12d0 (kernel), old 0x0 (None)
2020/10/13 18:47:07 ZEBRA: netlink_parse_info: netlink-listen (NS 0) type RTM_NEWLINK(16), len=1268, seq=0, pid=0
2020/10/13 18:47:07 ZEBRA: RTM_NEWLINK update for veth1(5) sl_type 0 master 0 flags 0x11003
2020/10/13 18:47:07 ZEBRA: Intf veth1(5) has gone DOWN
2020/10/13 18:47:07 ZEBRA: MESSAGE: ZEBRA_INTERFACE_DOWN veth1(0)
2020/10/13 18:47:07 ZEBRA: rib_update: Scheduled VRF (ALL), event RIB_UPDATE_KERNEL
2020/10/13 18:47:07 ZEBRA: netlink_parse_info: netlink-listen (NS 0) type RTM_NEWROUTE(24), len=116, seq=0, pid=0
2020/10/13 18:47:07 ZEBRA: RTM_NEWROUTE ipv6 unicast proto boot NS 0
2020/10/13 18:47:07 ZEBRA: netlink_parse_info: netlink-listen (NS 0) type RTM_NEWROUTE(24), len=116, seq=0, pid=0
2020/10/13 18:47:07 ZEBRA: RTM_NEWROUTE ipv6 unicast proto kernel NS 0
2020/10/13 18:47:07 ZEBRA: netlink_parse_info: netlink-listen (NS 0) type RTM_NEWLINK(16), len=1268, seq=0, pid=0
2020/10/13 18:47:07 ZEBRA: RTM_NEWLINK update for veth1(5) sl_type 0 master 0 flags 0x11043
2020/10/13 18:47:07 ZEBRA: Intf veth1(5) has come UP
2020/10/13 18:47:07 ZEBRA: MESSAGE: ZEBRA_INTERFACE_UP veth1(0)
2020/10/13 18:47:07 ZEBRA: netlink_parse_info: netlink-listen (NS 0) type RTM_NEWLINK(16), len=1268, seq=0, pid=0
2020/10/13 18:47:07 ZEBRA: RTM_NEWLINK update for veth0(4) sl_type 0 master 0 flags 0x11043
2020/10/13 18:47:07 ZEBRA: Intf veth0(4) has come UP
2020/10/13 18:47:07 ZEBRA: MESSAGE: ZEBRA_INTERFACE_UP veth0(0)
2020/10/13 18:47:07 ZEBRA: MESSAGE: ZEBRA_INTERFACE_ADDRESS_ADD fe80::5837:f1ff:fee7:2ac2/64 on veth0(0)
2020/10/13 18:47:07 ZEBRA: rib_add_multipath_nhe: (0:254):fe80::/64: Inserting route rn 0x10ef590, re 0x12061a0 (connected) existing (nil)
2020/10/13 18:47:07 ZEBRA: rib_add_multipath_nhe: (0:254):fe80::/64 (MRIB): Inserting route rn 0x10ef720, re 0x12602c0 (connected) existing (nil)
2020/10/13 18:47:07 ZEBRA: MESSAGE: ZEBRA_INTERFACE_ADDRESS_ADD 10.20.0.1/24 on veth0(0)
2020/10/13 18:47:07 ZEBRA: rib_add_multipath_nhe: (0:0):10.20.0.0/24: Inserting route rn 0x1260430, re 0x125f6e0 (connected) existing (nil)
2020/10/13 18:47:07 ZEBRA: rib_add_multipath_nhe: (0:0):10.20.0.0/24 (MRIB): Inserting route rn 0x12607c0, re 0x125fff0 (connected) existing (nil)
2020/10/13 18:47:07 ZEBRA: MESSAGE: ZEBRA_INTERFACE_ADDRESS_ADD 10.20.0.2/24 on veth0(0)
2020/10/13 18:47:07 ZEBRA: rib_add_multipath_nhe: (0:254):10.20.0.0/24: Inserting route rn 0x1260430, re 0x12608a0 (connected) existing 0x125f6e0
2020/10/13 18:47:07 ZEBRA: rib_delnode: (0:254):10.20.0.0/24: rn 0x1260430, re 0x125f6e0, removing
2020/10/13 18:47:07 ZEBRA: rib_add_multipath_nhe: (0:254):10.20.0.0/24 (MRIB): Inserting route rn 0x12607c0, re 0x10ef690 (connected) existing 0x125fff0
2020/10/13 18:47:07 ZEBRA: rib_delnode: (0:254):10.20.0.0/24 (MRIB): rn 0x12607c0, re 0x125fff0, removing
2020/10/13 18:47:07 ZEBRA: rib_update_handle_vrf_all: Handling VRF (ALL) event RIB_UPDATE_KERNEL
2020/10/13 18:47:07 ZEBRA: rib_update_table: IPv4 VRF default Table 254 event RIB_UPDATE_KERNEL
2020/10/13 18:47:07 ZEBRA: rib_update_table: IPv4 VRF default Table 254 event RIB_UPDATE_KERNEL
2020/10/13 18:47:07 ZEBRA: rib_update_table: IPv6 VRF default Table 254 event RIB_UPDATE_KERNEL
2020/10/13 18:47:07 ZEBRA: rib_update_table: IPv6 VRF default Table 254 event RIB_UPDATE_KERNEL
2020/10/13 18:47:07 ZEBRA: default(0:254):fe80::/64: Adding route rn 0x10ef720, re 0x10ef3e0 (connected)
2020/10/13 18:47:07 ZEBRA: default(0:254):10.20.0.0/24: Adding route rn 0x1260430, re 0x12608a0 (connected)
2020/10/13 18:47:07 ZEBRA: rib_process: (0:254):10.20.0.0/24: rn 0x1260430, removing re 0x125f6e0
2020/10/13 18:47:07 ZEBRA: rib_unlink: (0:254):10.20.0.0/24: rn 0x1260430, re 0x125f6e0
2020/10/13 18:47:07 ZEBRA: default(0:254):10.20.0.0/24: Adding route rn 0x12607c0, re 0x10ef690 (connected)
2020/10/13 18:47:07 ZEBRA: rib_process: (0:254):10.20.0.0/24 (MRIB): rn 0x12607c0, removing re 0x125fff0
2020/10/13 18:47:07 ZEBRA: rib_unlink: (0:254):10.20.0.0/24 (MRIB): rn 0x12607c0, re 0x125fff0
2020/10/13 18:47:07 ZEBRA: default(0:254):fe80::/64 update_from_ctx(): no fib nhg
2020/10/13 18:47:07 ZEBRA: default(0:254):fe80::/64 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:07 ZEBRA: (0:254):fe80::/64: Redist update re 0x10ef360 (connected), old 0x0 (None)
2020/10/13 18:47:07 ZEBRA: Redist update filter prefix fe80::/64
2020/10/13 18:47:07 ZEBRA: default(0:254):10.20.0.0/24 update_from_ctx(): no fib nhg
2020/10/13 18:47:07 ZEBRA: default(0:254):10.20.0.0/24 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:07 ZEBRA: (0:254):10.20.0.0/24: Redist update re 0x12608a0 (connected), old 0x0 (None)
2020/10/13 18:47:07 ZEBRA: default(0:254):0.0.0.0/0 update_from_ctx(): no fib nhg
2020/10/13 18:47:07 ZEBRA: default(0:254):0.0.0.0/0 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:07 ZEBRA: (0:254):0.0.0.0/0: Redist update re 0x10efa90 (kernel), old 0x0 (None)
2020/10/13 18:47:07 ZEBRA: default(0:254):169.254.169.254/32 update_from_ctx(): no fib nhg
2020/10/13 18:47:07 ZEBRA: default(0:254):169.254.169.254/32 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:07 ZEBRA: (0:254):169.254.169.254/32: Redist update re 0x10f0050 (kernel), old 0x0 (None)
2020/10/13 18:47:07 ZEBRA: Redist update filter prefix 169.254.169.254/32
2020/10/13 18:47:07 ZEBRA: default(0:254):::/96 update_from_ctx(): no fib nhg
2020/10/13 18:47:07 ZEBRA: default(0:254):::/96 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:07 ZEBRA: (0:254):::/96: Redist update re 0x10f01c0 (kernel), old 0x0 (None)
2020/10/13 18:47:07 ZEBRA: default(0:254):::ffff:0:0/96 update_from_ctx(): no fib nhg
2020/10/13 18:47:07 ZEBRA: default(0:254):::ffff:0:0/96 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:07 ZEBRA: (0:254):::ffff:0:0/96: Redist update re 0x10f03c0 (kernel), old 0x0 (None)
2020/10/13 18:47:07 ZEBRA: default(0:254):2002:a00::/24 update_from_ctx(): no fib nhg
2020/10/13 18:47:07 ZEBRA: default(0:254):2002:a00::/24 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:07 ZEBRA: (0:254):2002:a00::/24: Redist update re 0x10f06a0 (kernel), old 0x0 (None)
2020/10/13 18:47:07 ZEBRA: default(0:254):2002:7f00::/24 update_from_ctx(): no fib nhg
2020/10/13 18:47:07 ZEBRA: default(0:254):2002:7f00::/24 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:07 ZEBRA: (0:254):2002:7f00::/24: Redist update re 0x10f0890 (kernel), old 0x0 (None)
2020/10/13 18:47:07 ZEBRA: default(0:254):2002:a9fe::/32 update_from_ctx(): no fib nhg
2020/10/13 18:47:07 ZEBRA: default(0:254):2002:a9fe::/32 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:07 ZEBRA: (0:254):2002:a9fe::/32: Redist update re 0x10f0a80 (kernel), old 0x0 (None)
2020/10/13 18:47:07 ZEBRA: default(0:254):2002:ac10::/28 update_from_ctx(): no fib nhg
2020/10/13 18:47:07 ZEBRA: default(0:254):2002:ac10::/28 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:07 ZEBRA: (0:254):2002:ac10::/28: Redist update re 0x10f0d00 (kernel), old 0x0 (None)
2020/10/13 18:47:07 ZEBRA: default(0:254):2002:c0a8::/32 update_from_ctx(): no fib nhg
2020/10/13 18:47:07 ZEBRA: default(0:254):2002:c0a8::/32 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:07 ZEBRA: (0:254):2002:c0a8::/32: Redist update re 0x10f0ef0 (kernel), old 0x0 (None)
2020/10/13 18:47:07 ZEBRA: default(0:254):2002:e000::/19 update_from_ctx(): no fib nhg
2020/10/13 18:47:07 ZEBRA: default(0:254):2002:e000::/19 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:07 ZEBRA: (0:254):2002:e000::/19: Redist update re 0x10f10e0 (kernel), old 0x0 (None)
2020/10/13 18:47:07 ZEBRA: default(0:254):3ffe:ffff::/32 update_from_ctx(): no fib nhg
2020/10/13 18:47:07 ZEBRA: default(0:254):3ffe:ffff::/32 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:07 ZEBRA: (0:254):3ffe:ffff::/32: Redist update re 0x10f12d0 (kernel), old 0x0 (None)
2020/10/13 18:47:09 ZEBRA: netlink_parse_info: netlink-listen (NS 0) type RTM_NEWADDR(20), len=72, seq=0, pid=0
2020/10/13 18:47:09 ZEBRA: netlink_interface_addr RTM_NEWADDR veth1 flags 0x80:
2020/10/13 18:47:09 ZEBRA:   IFA_ADDRESS   fe80::8881:50ff:fe6b:57b2/64
2020/10/13 18:47:09 ZEBRA:   IFA_CACHEINFO pref -1, valid -1
2020/10/13 18:47:09 ZEBRA: MESSAGE: ZEBRA_INTERFACE_ADDRESS_ADD fe80::8881:50ff:fe6b:57b2/64 on veth1(0)
2020/10/13 18:47:09 ZEBRA: rib_add_multipath_nhe: (0:254):fe80::/64: Inserting route rn 0x10ef590, re 0x125ff60 (connected) existing (nil)
2020/10/13 18:47:09 ZEBRA: rib_add_multipath_nhe: (0:254):fe80::/64 (MRIB): Inserting route rn 0x10ef720, re 0x12601a0 (connected) existing (nil)
2020/10/13 18:47:09 ZEBRA: netlink_parse_info: netlink-listen (NS 0) type RTM_NEWROUTE(24), len=116, seq=0, pid=0
2020/10/13 18:47:09 ZEBRA: RTM_NEWROUTE ipv6 local proto  NS 0
2020/10/13 18:47:09 ZEBRA: Route rtm_type: local(2) intentionally ignoring
2020/10/13 18:47:09 ZEBRA: default(0:254):fe80::/64: Adding route rn 0x10ef720, re 0x10ef3e0 (connected)
2020/10/13 18:47:09 ZEBRA: default(0:254):fe80::/64 update_from_ctx(): no fib nhg
2020/10/13 18:47:09 ZEBRA: default(0:254):fe80::/64 update_from_ctx(): rib nhg matched, changed 'false'
2020/10/13 18:47:09 ZEBRA: (0:254):fe80::/64: Redist update re 0x10ef360 (connected), old 0x0 (None)
2020/10/13 18:47:09 ZEBRA: Redist update filter prefix fe80::/64

sworleys commented 3 years ago

to be clear this is a kernel route we are losing right, not staticd configured route?

sworleys commented 3 years ago

if so, could you change the title to "kernel" route

devicenull commented 3 years ago

Yea, it's a kernel route (not staticd)

devicenull commented 3 years ago

I bisected this down to 212df1de28ae2a21145389c01cc3a1952f95c1a8 - The comment there isn't super clear to me, but removing routes entirely when the interface goes down sounds like the wrong behavior. Any thoughts @pguibert6WIND

sworleys commented 3 years ago

@devicenull this commit would cause the route to be removed due to the interface down event. We need to know where it broke from coming back. Is it also this commit?

From reading the comment history, this commit was used to address the kernel not sending ROUTE_DELETE on ifdown anymore that has been a thorn in our side for a while as you know... it shouldn't break the behavior for ifup.

devicenull commented 3 years ago

Yes - prior to that commit the route came back just fine.

I was bisecting with this test case:

vtysh -c 'sh ip route 10.20.20.0/24'
ip link set veth1 down
sleep 3
vtysh -c 'sh ip route 10.20.20.0/24'
ip link set veth1 up
sleep 3
vtysh -c 'sh ip route 10.20.20.0/24'

Prior to that commit, I'd see:

Routing entry for 10.20.20.0/24
  Known via "kernel", distance 0, metric 0, best
  Last update 00:00:04 ago
  * 10.20.0.2, via veth0

Routing entry for 10.20.20.0/24
  Known via "kernel", distance 0, metric 0
  Last update 00:00:07 ago
  * 10.20.0.2, via veth0 inactive

Routing entry for 10.20.20.0/24
  Known via "kernel", distance 0, metric 0, best
  Last update 00:00:10 ago
  * 10.20.0.2, via veth0

sworleys commented 3 years ago

well thats flabbergasting. out of curiosity were you able to repro this on a new kernel? kinda smells like a kernel bug.

Maybe I am missing it in the logs, but i dont ever see a NEWROUTE message for that route from the kernel? Do you see it?

sworleys commented 3 years ago

you could use ip monitor route to verify

devicenull commented 3 years ago

This is on 3.10.0-1127.19.1.el7.x86_64 (default cent7 kernel)

[root@vultr ~]# ip mon
5: veth1@veth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 8a:81:50:6b:57:b2 brd ff:ff:ff:ff:ff:ff
Deleted ff02::16 dev veth1 lladdr 33:33:00:00:00:16 NOARP
Deleted ff02::2 dev veth1 lladdr 33:33:00:00:00:02 NOARP
Deleted ff02::1:ff6b:57b2 dev veth1 lladdr 33:33:ff:6b:57:b2 NOARP
Deleted fe80::/64 dev veth1 proto kernel metric 256 pref medium
Deleted ff00::/8 dev veth1 table local metric 256 pref medium
Deleted 5: veth1    inet6 fe80::8881:50ff:fe6b:57b2/64 scope link
       valid_lft forever preferred_lft forever
Deleted local fe80::8881:50ff:fe6b:57b2 dev lo table local proto unspec metric 0 pref medium
4: veth0@veth1: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default
    link/ether 5a:37:f1:e7:2a:c2 brd ff:ff:ff:ff:ff:ff
5: veth1@veth0: <NO-CARRIER,BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default
    link/ether 8a:81:50:6b:57:b2 brd ff:ff:ff:ff:ff:ff
ff00::/8 dev veth1 table local metric 256 pref medium
fe80::/64 dev veth1 proto kernel metric 256 pref medium
5: veth1@veth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 8a:81:50:6b:57:b2 brd ff:ff:ff:ff:ff:ff
4: veth0@veth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 5a:37:f1:e7:2a:c2 brd ff:ff:ff:ff:ff:ff
5: veth1    inet6 fe80::8881:50ff:fe6b:57b2/64 scope link
       valid_lft forever preferred_lft forever
local fe80::8881:50ff:fe6b:57b2 dev lo table local proto unspec metric 0 pref medium

Let me grab a newer kernel

devicenull commented 3 years ago

This is on 5.9.0-1.el7.elrepo.x86_64

[root@vultr ~]# ip monitor route
Deleted fe80::/64 dev veth1 proto kernel metric 256 pref medium
Deleted local fe80::d4b6:56ff:fea7:f05e dev veth1 table local proto kernel metric 0 pref medium
Deleted ff00::/8 dev veth1 table local metric 256 pref medium
ff00::/8 dev veth1 table local metric 256 pref medium
fe80::/64 dev veth1 proto kernel metric 256 pref medium
local fe80::d4b6:56ff:fea7:f05e dev veth1 table local proto kernel metric 0 pref medium

That's while running the test case script above. Corresponding zebra output:

Routing entry for 10.20.20.0/24
  Known via "kernel", distance 0, metric 0, best
  Last update 00:00:06 ago
  * 10.20.0.2, via veth0

% Network not in table
% Network not in table

sworleys commented 3 years ago

This one is going to require some deeper investigation... I was able to repro just a moment ago. I think we may be making some incorrect assumptions somewhere about how kernel routes and interface events along with what messages we should be receiving from the kernel.

Specifically, it looks like the kernel doesn't remove the route for link going down on the other side but FRR does. And I think thats the fix we need probably. To be more semantic with that. We assume they are proto down events and the kernel withdrew the route.

fdomain commented 3 years ago

I ran into the same issue (almost the same conditions as #7539).

If the link goes down, routes are marked linkdown in the kernel, removed from Zebra, but only the directly connected route comes back once the link is re-established.

Before the link goes down

default via 10.189.10.1 dev enp96s0f0 
10.189.10.0/25 dev enp96s0f0 proto kernel scope link src 10.189.10.10

routes in FRR

K>* 0.0.0.0/0 [0/0] via 10.189.10.1, enp96s0f0, 00:00:24
C>* 10.189.10.0/25 is directly connected, enp96s0f0, 00:00:24

Once the link is down

default via 10.189.10.1 dev enp96s0f0 linkdown 
10.189.10.0/25 dev enp96s0f0 proto kernel scope link src 10.189.10.10 linkdown

no routes in FRR

Once the link is re-established

default via 10.189.10.1 dev enp96s0f0 
10.189.10.0/25 dev enp96s0f0 proto kernel scope link src 10.189.10.10

routes in FRR (default route is missing)

C>* 10.189.10.0/25 is directly connected, enp96s0f0, 00:00:04

FRR Version : 7.5.1 OS: CentOS 7 Kernel: 5.4.119

Let me know if you think it could be useful to share the configuration and full log (or open a new issue).

coofercat commented 2 years ago

I've been tracking a problem that looks the same as this for a while. I have a handful of servers at a cloud provider who gives their servers interface information via DHCP. For various reasons, two servers have been configured with static IPs (after being handed them by DHCP) - these two servers never seem to show the problem of routes disappearing.

The remaining (DHCP) servers all show the problem at sporadic intervals. It's not practical to statically configure them, so I need a fix. It seems that whatever happens to the interface (presumably a flap?) causes the system to remove all the routes on it. The system goes on to put back the route to the local network, but FRR doesn't put the BGP routes back.

As with others experiences above, FRR still knows the routes, and still thinks all is well, so there's no problem at the BGP level. It would appear that a signal to FRR is missing, and so it is just unaware that routes it has previously added have been removed.

I found there's a possible workaround that is to run a script when the interface is (re)configured. If you have networkd-dispatcher, then it'll run scripts in /etc/networkd-dispatcher/routable.d when the interface is reconfigured. Such a script could restart FRR to prompt it to re-add the BGP routes.

I found this information here: https://unix.stackexchange.com/a/572302/176919 - but the test case provided (to run dhclient to refresh the IP) doesn't cause the problem - if you run dhclient as suggested, then FRR works correctly. It's only that FRR doesn't notice. That's a much harder test case for me to simulate, and comes with almost no logging that I can see. I'll throw a script in the right place and see if it turns anything up and report back - that could take hours or days though.

FWIW, I'm running FRR 7.2.1, Ubuntu 20.04, kernel 5.4.0-88-generic.

coofercat commented 2 years ago

I've caught a DHCP issue, and my workaround "fixed" it. Here's the log around that time:

Dec 12 07:01:15 myserver systemd-networkd[615]: ens3: DHCP lease lost
Dec 12 07:01:15 myserver dbus-daemon[726]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service' requested by ':1.2' (uid=100 pid=615 comm="/lib/systemd/systemd-networkd " label="unconfined")
Dec 12 07:01:15 myserver systemd[1]: Starting Hostname Service...
Dec 12 07:01:15 myserver systemd-networkd[615]: ens3: DHCPv4 address 10.0.16.135/24
Dec 12 07:01:15 myserver systemd-networkd[615]: ens3: DHCP: No gateway received from DHCP server.
Dec 12 07:01:15 myserver dbus-daemon[726]: [system] Successfully activated service 'org.freedesktop.hostname1'
Dec 12 07:01:15 myserver systemd[1]: Started Hostname Service.
Dec 12 07:01:15 myserver systemd-hostnamed[59306]: Changed host name to 'host-10-0-16-135'
Dec 12 07:01:15 myserver frr_workaround: /etc/networkd-dispatcher/routable.d/workaround.sh IFACE=ens3 AdministrativeState=configured - restarting FRR
Dec 12 07:01:15 myserver systemd[1]: Stopping FRRouting...
Dec 12 07:01:15 myserver frrinit.sh[59333]:  * Stopped watchfrr
Dec 12 07:01:15 myserver staticd[53684]: Terminating on signal
Dec 12 07:01:15 myserver frrinit.sh[59345]:  * Stopped staticd
Dec 12 07:01:15 myserver frrinit.sh[59347]:  * Stopped zebra
Dec 12 07:01:15 myserver frrinit.sh[59346]:  * Stopped bgpd
Dec 12 07:01:15 myserver systemd[1]: frr.service: Succeeded.
Dec 12 07:01:15 myserver systemd[1]: Stopped FRRouting.
Dec 12 07:01:15 myserver systemd[1]: Starting FRRouting...
Dec 12 07:01:15 myserver watchfrr.sh: Cannot stop zebra: pid file not found
Dec 12 07:01:15 myserver watchfrr.sh: Cannot stop bgpd: pid file not found
Dec 12 07:01:15 myserver watchfrr.sh: Cannot stop staticd: pid file not found
Dec 12 07:01:15 myserver frrinit.sh[59357]:  * Started watchfrr
Dec 12 07:01:15 myserver systemd[1]: Started FRRouting.
Dec 12 07:01:45 myserver systemd[1]: systemd-hostnamed.service: Succeeded.

My workaround script (/etc/networkd-dispatcher/routable.d/workaround.sh) looks like this:

#!/bin/bash

# Workaround FRR problem: https://github.com/FRRouting/frr/issues/7299

echo "$0 IFACE=${IFACE} AdministrativeState=$AdministrativeState - restarting FRR" | logger -t frr_workaround

systemctl status frr > /dev/null 2>&1
FRR=$?

if [ "${FRR}" = "0" ]; then
        systemctl restart frr > /dev/null 2>&1
fi

(remember to chmod 755 it!)

I guess a slight refinement might be to try to see if one of the routes is there, and if not then to restart FRR. That might avoid unnecessary disruption, although would take slightly longer to run when it needs to. Either way, this workaround has so far prevented three longer-term problems, but is of course a "sledgehammer to crack a nut".

scratchy commented 2 years ago

Somehow related to https://bugs.launchpad.net/ubuntu/+source/frr/+bug/1968550 . At least in both cases routes are missing which should be there.

leonshaw commented 2 years ago

Specifically, it looks like the kernel doesn't remove the route for link going down on the other side but FRR does.

@sworleys I ran into this issue and did some investigation. The kernel marks the route linkdown without RTM_DELROUTE events, thus no RTM_NEWROUTE when the interface comes up. Another tricky thing is /proc/sys/net/ipv?/conf/*/ignore_routes_with_linkdown (see also ip netconf and nlattr NETCONFA_IGNORE_ROUTES_WITH_LINKDOWN). If it's 0 — the default, the route is NOT ignored and is effectively a black hole.

FRRouting / frr

Kernel routes go missing after interface flaps #7299