sysoleg commented 12 months ago

I can reliably reproduce the situation where zebra won't automatically recover after a failed RTM_NEWNEXTHOP due to a Carrier for nexthop device is down error. I'm using a virtual device here, but we've seen the same problem on a production system after a link flap on a physical NIC.

Here is the latest FRR configuration sufficient to reproduce the behaviour:

interface dum0
 ip address 192.168.0.1/30
exit

ip route 192.168.55.55/32 192.168.0.2

Here is the script to reproduce the bug:

#!/bin/sh

# Assume the initial operstate of dum0 is UP
# dum0 added with `ip link add dum0 type dummy` and configured with FRR

echo 0 > /sys/class/net/dum0/carrier
sleep 0.5

echo 1 > /sys/class/net/dum0/carrier
logger "dum0 carrier is up"

echo 0 > /sys/class/net/dum0/carrier
logger "dum0 carrier is down"
sleep 0.2

#sleep 1 # To prevent the bug from reproducing, uncomment this line

echo 1 > /sys/class/net/dum0/carrier

After running the script I don't see the 192.168.55.55/32 route in the kernel table.

Log from zebra with debugging enabled: https://gist.github.com/sysoleg/c171e06b7cde67c2c4d06810fcae300e

The reason why RTM_NEWNEXTHOP failed is because netif_carrier_ok(dev) is false (see rtm_to_nh_config() in the Linux kernel source). The NIC driver calls netif_carrier_off(dev) and this immediately sets the __LINK_STATE_NOCARRIER device state flag. The RTM_NEWLINK message is not even sent at this stage, only possibly scheduled (see below). Since there is almost no delay between the echo 1 and echo 0, the flag is set (again) before zebra starts to install nexthops into the kernel. At that moment, zebra thinks that it's OK to install the nexthop object, but it's not, and it doesn't know it.

Now to the reason why automatic recovery fails.

Running the script without the last sleep (sleep 1) gives us three RTM_NEWLINK events (you can monitor these with the ip monitor coomand):

At 09:46:30.857392 with operstate=DOWN, after writing 0 to the carrier file
At 09:46:31.359557 with operstate=UP, after writing 1 to the carrier file
At 09:46:31.865158 the same as the previous event (note 1s and not ~0.7s delay as the script suggests, see explanation below)

So only three events (DOWN, UP, UP). If there were four (DOWN, UP, DOWN, UP) - automatic recovery would be possible after the last (fourth) UP. But in the three-event scenario zebra won't try to reinstall nexthops after the third UP because it thinks the interface is already UP (having already processed the second UP with problems installing objects into the kernel).

The reason why we only get three events is in the Linux kernel. Looking into linkwatch_fire_event(), which is called by netif_carrier_off(dev) or netif_carrier_on(dev), we can see that the work, that leads to the operstate update and eventually the RTM_NETLINK being sent, is only scheduled immediately if the event is "urgent" (link is up in our case) or if there's no linkwatch event currently pending and we're "wrapped around" (see linkwatch_schedule_work()). Otherwise, work is delayed by up to HZ jiffies (1 second) to (quote) "runaway driver does not cause a storm of messages on the netlink socket".

If you uncomment the sleep 1 line in the mentioned script, you'll get four events, because in this case you're delaying the last carrier "on", which allows delayed work that was created after the previous carrier "off" to update the device link status and eventually send an event while the carrier is still "off". No nexthop problems after the last UP in this case.

So the kernel is working as it should: if not urgent - no more than one event per second, but as we have two consecutive UPs with no DOWN in between, the last UP is ignored by the FRR.

Any thoughts on this?

Jafaral commented 11 months ago

@riw777 any PRs you know of?

riw777 commented 11 months ago

12239, #14515, #14102 ?? These all seem to be in the same area?

sysoleg commented 10 months ago

The problem is still reproducible in the latest master. Before running the script (see issue description above):

# ip next show dev dum0
id 7 dev dum0 scope link proto zebra 
id 12 via 192.168.0.2 dev dum0 scope link proto zebra 

# vtysh -c 'show nexthop rib 7'
ID: 7 (zebra)
     RefCnt: 2
     Uptime: 00:00:47
     VRF: default
     Valid, Installed
     Interface Index: 9
           is directly connected, dum0 (vrf default)

# vtysh -c 'show nexthop rib 12'
ID: 12 (zebra)
     RefCnt: 1
     Uptime: 00:00:47
     VRF: default
     Valid, Installed
     Interface Index: 9
           via 192.168.0.2, dum0 (vrf default), weight 1

# ip ro show dev dum0
192.168.0.0/30 proto kernel scope link src 192.168.0.1 
192.168.55.55 nhid 12 via 192.168.0.2 proto static metric 20 

# vtysh -c 'show ip route'
<skip>
C>* 192.168.0.0/30 is directly connected, dum0, 00:00:09
S>* 192.168.55.55/32 [1/0] via 192.168.0.2, dum0, weight 1, 00:00:08

After running the script:

# ip next show dev dum0
<empty output>

# vtysh -c 'show nexthop rib'
ID: 16 (zebra)
     RefCnt: 1
     Uptime: 00:00:15
     VRF: default
     Valid
     Interface Index: 9
           via 192.168.0.2, dum0 (vrf default), weight 1
<skip>
ID: 14 (zebra)
     RefCnt: 2
     Uptime: 00:00:15
     VRF: default
     Valid
     Interface Index: 9
           is directly connected, dum0 (vrf default)

# ip ro show dev dum0
192.168.0.0/30 proto kernel scope link src 192.168.0.1

# vtysh -c 'show ip route' 
<skip>
C>* 192.168.0.0/30 is directly connected, dum0, 00:00:10
S>r 192.168.55.55/32 [1/0] via 192.168.0.2, dum0, weight 1, 00:00:10

Note the absence of Installed in the nexthop entry and the "r" (ROUTE_ENTRY_FAILED) flag for static route.

Relevant log messages:

Nov 22 20:06:59 cl1 zebra[160819]: [YXCJP-0WZWV] netlink_nexthop_msg_encode: ID (14): directly connected, dum0(9) vrf default(0)
Nov 22 20:06:59 cl1 zebra[160819]: [R43C6-KYHWT] netlink_nexthop_msg_encode: RTM_NEWNEXTHOP, id=14
Nov 22 20:06:59 cl1 zebra[160819]: [HYEHE-CQZ9G] nl_batch_send: netlink-dp (NS 0), batch size=40, msg cnt=1
Nov 22 20:06:59 cl1 zebra[160819]: [TJ327-ET8HE] netlink_send_msg: >> netlink message dump [sent]
Nov 22 20:06:59 cl1 zebra[160819]: [JAS4D-NCWGP] nlmsghdr [len=40 type=(104) NEWNEXTHOP flags=(0x0501) {REQUEST,DUMP,(ROOT|REPLACE|CAPPED),(ATOMIC|CREATE)} seq=14 pid=3231857309]
Nov 22 20:06:59 cl1 zebra[160819]: [WCX94-SW894]   nhm [family=(2) AF_INET scope=(0) UNIVERSE protocol=(11) ZEBRA flags=0x00000000 {}]
Nov 22 20:06:59 cl1 zebra[160819]: [KFBSR-XYJV1]     rta [len=8 (payload=4) type=(1) ID]
Nov 22 20:06:59 cl1 zebra[160819]: [Z4E9C-GD9EP]       14
Nov 22 20:06:59 cl1 zebra[160819]: [KFBSR-XYJV1]     rta [len=8 (payload=4) type=(5) OIF]
Nov 22 20:06:59 cl1 zebra[160819]: [JR4EA-BKPTA]       9
Nov 22 20:06:59 cl1 zebra[160819]: [V8KNF-8EXH8] netlink_recv_msg: << netlink message dump [recv]
Nov 22 20:06:59 cl1 zebra[160819]: [JAS4D-NCWGP] nlmsghdr [len=76 type=(2) ERROR flags=(0x0300) {DUMP,(ROOT|REPLACE|CAPPED),(MATCH|EXCLUDE|ACK_TLVS)} seq=14 pid=3231857309]
Nov 22 20:06:59 cl1 zebra[160819]: [KWP1C-6CSXF]   nlmsgerr [error=(-100) Network is down]
Nov 22 20:06:59 cl1 zebra[160819]: [HSYZM-HV7HF] Extended Error: Carrier for nexthop device is down
Nov 22 20:06:59 cl1 zebra[160819]: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: Network is down, type=RTM_NEWNEXTHOP(104), seq=14, pid=3231857309
Nov 22 20:06:59 cl1 zebra[160819]: [QTT8V-3ZQ34] nl_batch_read_resp: netlink error message seq=14
Nov 22 20:06:59 cl1 zebra[160819]: [X5XE1-RS0SW][EC 4043309074] Failed to install Nexthop (14[if 9 vrfid 0]) into the kernel

Nov 22 20:06:59 cl1 zebra[160819]: [YXCJP-0WZWV] netlink_nexthop_msg_encode: ID (16): 192.168.0.2, via dum0(9) vrf default(0)
Nov 22 20:06:59 cl1 zebra[160819]: [R43C6-KYHWT] netlink_nexthop_msg_encode: RTM_NEWNEXTHOP, id=16
Nov 22 20:06:59 cl1 zebra[160819]: [HYEHE-CQZ9G] nl_batch_send: netlink-dp (NS 0), batch size=48, msg cnt=1
Nov 22 20:06:59 cl1 zebra[160819]: [TJ327-ET8HE] netlink_send_msg: >> netlink message dump [sent]
Nov 22 20:06:59 cl1 zebra[160819]: [JAS4D-NCWGP] nlmsghdr [len=48 type=(104) NEWNEXTHOP flags=(0x0501) {REQUEST,DUMP,(ROOT|REPLACE|CAPPED),(ATOMIC|CREATE)} seq=16 pid=3231857309]
Nov 22 20:06:59 cl1 zebra[160819]: [WCX94-SW894]   nhm [family=(2) AF_INET scope=(0) UNIVERSE protocol=(11) ZEBRA flags=0x00000000 {}]
Nov 22 20:06:59 cl1 zebra[160819]: [KFBSR-XYJV1]     rta [len=8 (payload=4) type=(1) ID]
Nov 22 20:06:59 cl1 zebra[160819]: [Z4E9C-GD9EP]       16
Nov 22 20:06:59 cl1 zebra[160819]: [KFBSR-XYJV1]     rta [len=8 (payload=4) type=(6) GATEWAY]
Nov 22 20:06:59 cl1 zebra[160819]: [M8QV4-KY9C0]       192.168.0.2
Nov 22 20:06:59 cl1 zebra[160819]: [KFBSR-XYJV1]     rta [len=8 (payload=4) type=(5) OIF]
Nov 22 20:06:59 cl1 zebra[160819]: [JR4EA-BKPTA]       9
Nov 22 20:06:59 cl1 zebra[160819]: [V8KNF-8EXH8] netlink_recv_msg: << netlink message dump [recv]
Nov 22 20:06:59 cl1 zebra[160819]: [JAS4D-NCWGP] nlmsghdr [len=76 type=(2) ERROR flags=(0x0300) {DUMP,(ROOT|REPLACE|CAPPED),(MATCH|EXCLUDE|ACK_TLVS)} seq=16 pid=3231857309]
Nov 22 20:06:59 cl1 zebra[160819]: [KWP1C-6CSXF]   nlmsgerr [error=(-100) Network is down]
Nov 22 20:06:59 cl1 zebra[160819]: [HSYZM-HV7HF] Extended Error: Carrier for nexthop device is down
Nov 22 20:06:59 cl1 zebra[160819]: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: Network is down, type=RTM_NEWNEXTHOP(104), seq=16, pid=3231857309
Nov 22 20:06:59 cl1 zebra[160819]: [QTT8V-3ZQ34] nl_batch_read_resp: netlink error message seq=16
Nov 22 20:06:59 cl1 zebra[160819]: [YXPF5-B2CE0] netlink_route_multipath_msg_encode: RTM_NEWROUTE 192.168.55.55/32 vrf 0(254)
Nov 22 20:06:59 cl1 zebra[160819]: [J87BH-XW5PP] netlink_route_multipath_msg_encode: 192.168.55.55/32 nhg_id is 16
Nov 22 20:06:59 cl1 zebra[160819]: [HYEHE-CQZ9G] nl_batch_send: netlink-dp (NS 0), batch size=52, msg cnt=1
Nov 22 20:06:59 cl1 zebra[160819]: [TJ327-ET8HE] netlink_send_msg: >> netlink message dump [sent]
Nov 22 20:06:59 cl1 zebra[160819]: [JAS4D-NCWGP] nlmsghdr [len=52 type=(24) NEWROUTE flags=(0x0501) {REQUEST,DUMP,(ROOT|REPLACE|CAPPED),(ATOMIC|CREATE)} seq=17 pid=3231857309]
Nov 22 20:06:59 cl1 zebra[160819]: [GCEGC-W8YBF]   rtmsg [family=(2) AF_INET dstlen=32 srclen=0 tos=0 table=254 protocol=(196) UNKNOWN scope=(0) UNIVERSE type=(1) UNICAST flags=0x0000 {}]
Nov 22 20:06:59 cl1 zebra[160819]: [KFBSR-XYJV1]     rta [len=8 (payload=4) type=(1) DST]
Nov 22 20:06:59 cl1 zebra[160819]: [M8QV4-KY9C0]       192.168.55.55
Nov 22 20:06:59 cl1 zebra[160819]: [KFBSR-XYJV1]     rta [len=8 (payload=4) type=(6) PRIORITY]
Nov 22 20:06:59 cl1 zebra[160819]: [Z4E9C-GD9EP]       20
Nov 22 20:06:59 cl1 zebra[160819]: [KFBSR-XYJV1]     rta [len=8 (payload=4) type=(30) NH_ID]
Nov 22 20:06:59 cl1 zebra[160819]: [Z4E9C-GD9EP]       16
Nov 22 20:06:59 cl1 zebra[160819]: [V8KNF-8EXH8] netlink_recv_msg: << netlink message dump [recv]
Nov 22 20:06:59 cl1 zebra[160819]: [JAS4D-NCWGP] nlmsghdr [len=68 type=(2) ERROR flags=(0x0300) {DUMP,(ROOT|REPLACE|CAPPED),(MATCH|EXCLUDE|ACK_TLVS)} seq=17 pid=3231857309]
Nov 22 20:06:59 cl1 zebra[160819]: [KWP1C-6CSXF]   nlmsgerr [error=(-22) Invalid argument]
Nov 22 20:06:59 cl1 zebra[160819]: [HSYZM-HV7HF] Extended Error: Nexthop id does not exist
Nov 22 20:06:59 cl1 zebra[160819]: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: Invalid argument, type=RTM_NEWROUTE(24), seq=17, pid=3231857309
Nov 22 20:06:59 cl1 zebra[160819]: [QTT8V-3ZQ34] nl_batch_read_resp: netlink error message seq=17
Nov 22 20:06:59 cl1 zebra[160819]: [MFYWV-KH3MC] process_subq_early_route_add: (0:?):192.168.55.55/32: Inserting route rn 0x5628b3007fc0, re 0x5628b30088f0 (static) existing 0x0, same_count 0
Nov 22 20:06:59 cl1 zebra[160819]: [Q4T2G-E2SQF] process_subq_early_route_add: dumping RE entry 0x5628b30088f0 for 192.168.55.55/32 vrf default(0)
Nov 22 20:06:59 cl1 zebra[160819]: [M5M58-9PD2R] 192.168.55.55/32: uptime == 10393009, type == 3, instance == 0, table == 254
Nov 22 20:06:59 cl1 zebra[160819]: [RVZMM-N7DME] 192.168.55.55/32: metric == 0, mtu == 0, distance == 1, flags == Recursion RR Distance status == None
Nov 22 20:06:59 cl1 zebra[160819]: [KC40Q-0DH3A] 192.168.55.55/32: tag == 0, nexthop_num == 1, nexthop_active_num == 0
Nov 22 20:06:59 cl1 zebra[160819]: [ZSB1Z-XM2V3] 192.168.55.55/32: NH 192.168.0.2[0] vrf default(0) wgt 1, with flags
Nov 22 20:06:59 cl1 zebra[160819]: [SCETK-GQ9E4] 192.168.55.55/32: dump complete
Nov 22 20:06:59 cl1 zebra[160819]: [GCGMT-SQR82] rib_link: (0:?):192.168.55.55/32: rn 0x5628b3007fc0 adding dest
Nov 22 20:06:59 cl1 zebra[160819]: [JF0K0-DVHWH] rib_meta_queue_add: (0:254):192.168.55.55/32: queued rn 0x5628b3007fc0 into sub-queue Static Routes
Nov 22 20:06:59 cl1 zebra[160819]: [NZNZ4-7P54Y] default(0:254):192.168.55.55/32: Processing rn 0x5628b3007fc0
Nov 22 20:06:59 cl1 zebra[160819]: [ZJVZ4-XEGPF] default(0:254):192.168.55.55/32: Examine re 0x5628b30088f0 (static) status: Changed flags: Recursion RR Distance dist 1 metric 0
Nov 22 20:06:59 cl1 zebra[160819]: [JPJF4-TGCY5] default(0:254):192.168.55.55/32: After processing: old_selected 0x0 new_selected 0x5628b30088f0 old_fib 0x0 new_fib 0x5628b30088f0
Nov 22 20:06:59 cl1 zebra[160819]: [ZFRN1-EZNJZ] default(0:254):192.168.55.55/32: Adding route rn 0x5628b3007fc0, re 0x5628b30088f0 (static)
Nov 22 20:06:59 cl1 zebra[160819]: [HH6N2-PDCJS] default(0:254):192.168.55.55/32 rn 0x5628b3007fc0 dequeued from sub-queue Static Routes
Nov 22 20:06:59 cl1 zebra[160819]: [X5XE1-RS0SW][EC 4043309074] Failed to install Nexthop (16[192.168.0.2 if 9 vrfid 0]) into the kernel
Nov 22 20:06:59 cl1 staticd[160826]: [S4MGP-4WQTA] route_notify_owner: Route 192.168.55.55/32 failed to install for table: 254
Nov 22 20:06:59 cl1 zebra[160819]: [HZ1TW-92EY6] Notifying Owner: static about prefix 192.168.55.55/32(254) 0 vrf: 0
Nov 22 20:06:59 cl1 zebra[160819]: [VYKYC-709DP] default(0:254):192.168.55.55/32: Route install failed

The problem is that the link carrier goes down while zebra is installing nexthops into the kernel. Running zebra with no zebra nexthop kernel enable seems to be a workaround. The problem doesn't reproduce then.

zstas commented 6 months ago

I have the similar issue, when interface flaps sometimes I get such errors and no routes installed. any plans to solve it? I can give it a try if smb will tell me the direction.

sysoleg commented 6 months ago

@zstas Have you seen this PR: https://github.com/FRRouting/frr/pull/15082?

zstas commented 6 months ago

well, this is good news. I hope it will be merged soon, so I can backport it to my branch. thank you!

a-willem commented 6 months ago

Hi, I've possibly ran in the same condition, interface was considered as invalid, YET the BGP peerings are up over it, routes imported but marked as rejected. As this happens in a production env, it's hard to reproduce. I've added some logging and alerting to catch more precise logs the next time it happens.

FRRouting / frr

Zebra won't automatically recover after a failed RTM_NEWNEXTHOP (bug analysis) #14481

12239, #14515, #14102 ?? These all seem to be in the same area?