FRRouting / frr

The FRRouting Protocol Suite
https://frrouting.org/
Other
3.34k stars 1.25k forks source link

On Docker, zebra crashed when interface state moved from up->down->up #13523

Closed skaliassk closed 10 months ago

skaliassk commented 1 year ago

While running FRR on Docker, zebra crashed when interface state moved from up->down->up

Topology:

Peer Device -------------17.0.0.0/8------------------- FRR on Docker (with macvlan mode)

Crash log

2023/05/11 14:19:52 BGP: [WNKP5-SN018] Found existing bnc 17.222.0.254/32(0)(VRF default) flags 0xa ifindex 0 #paths 0 peer 0x7f212afd7010
2023/05/11 14:20:09 BGP: [N25MR-FXT2C] Rx Intf down VRF 0 IF eth0
2023/05/11 14:20:09 BGP: [N25MR-FXT2C] Rx Intf down VRF 0 IF eth0
2023/05/11 14:20:09 BGP: [KGTKH-FVHEW] Rx Router Id update VRF 0 Id 0.0.0.0/32
2023/05/11 14:20:09 BGP: [WMCA1-27995] RID change : vrf VRF default(0), RTR ID 0.0.0.0
2023/05/11 14:20:09 BGP: [ZN4WJ-AVQKV] Rx Intf address del VRF 0 IF eth0 addr 17.0.0.2/8
2023/05/11 14:20:09 ZEBRA: [HSYZM-HV7HF] Extended Error: Carrier for nexthop device is down
2023/05/11 14:20:09 ZEBRA: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: Network is down, type=RTM_NEWNEXTHOP(104), seq=5, pid=4194334435
2023/05/11 14:20:09 ZEBRA: [X5XE1-RS0SW][EC 4043309074] Failed to install Nexthop (4[17.0.0.1 if 8203]) into the kernel
2023/05/11 14:20:42 BGP: [ZXFVW-H54SV] Rx Intf up VRF 0 IF eth0
2023/05/11 14:20:42 BGP: [ZXFVW-H54SV] Rx Intf up VRF 0 IF eth0
ZEBRA: Received signal 11 at 1683795042 (si_addr 0xc8, PC 0x563dd6992a04); aborting...
2023/05/11 14:20:42 BGP: [KGTKH-FVHEW] Rx Router Id update VRF 0 Id 17.0.0.2/32
2023/05/11 14:20:42 BGP: [WMCA1-27995] RID change : vrf VRF default(0), RTR ID 17.0.0.2
2023/05/11 14:20:42 BGP: [GYPW0-GVZQ8] Rx Intf address add VRF 0 IF eth0 addr 17.0.0.2/8
ZEBRA: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_backtrace_sigsafe+0x6d) [0x7fc8e41adccd]
ZEBRA: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_signal+0xf3) [0x7fc8e41aded3]
ZEBRA: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(+0xce631) [0x7fc8e41da631]
ZEBRA: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7fc8e40d9730]
ZEBRA: /usr/lib/frr/zebra(zebra_vxlan_macvlan_up+0x24) [0x563dd6992a04]
ZEBRA: /usr/lib/frr/zebra(if_up+0x248) [0x563dd6914238]
ZEBRA: /usr/lib/frr/zebra(netlink_link_change+0xc6b) [0x563dd690e9ab]
ZEBRA: /usr/lib/frr/zebra(netlink_parse_info+0x14b) [0x563dd691a25b]
ZEBRA: /usr/lib/frr/zebra(+0x954ea) [0x563dd691a4ea]
ZEBRA: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(thread_call+0x7d) [0x7fc8e41ec4ed]
ZEBRA: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(frr_run+0xe8) [0x7fc8e41a6178]
ZEBRA: /usr/lib/frr/zebra(main+0x3a3) [0x563dd6907333]
ZEBRA: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7fc8e3f2b09b]
ZEBRA: /usr/lib/frr/zebra(_start+0x2a) [0x563dd6907f6a]
ZEBRA: in thread kernel_read scheduled from ../zebra/kernel_netlink.c:505 kernel_read()
2023/05/11 14:20:47 STATIC: [MRN6F-AYZC4] Terminating on signal
2023/05/11 14:20:48 BGP: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
2023/05/11 14:20:52 BGP: [TNK7N-FJF7K] Registering VRF 0
2023/05/11 14:20:52 BGP: [HKBB3-YX6A9] Rx Intf add VRF 0 IF eth0
2023/05/11 14:20:52 BGP: [HKBB3-YX6A9] Rx Intf add VRF 0 IF lo
2023/05/11 14:20:52 BGP: [HKBB3-YX6A9] Rx Intf add VRF 0 IF eth0
2023/05/11 14:20:52 BGP: [GYPW0-GVZQ8] Rx Intf address add VRF 0 IF eth0 addr 17.0.0.2/8
2023/05/11 14:20:52 BGP: [HKBB3-YX6A9] Rx Intf add VRF 0 IF lo
2023/05/11 14:20:52 BGP: [KGTKH-FVHEW] Rx Router Id update VRF 0 Id 17.0.0.2/32
2023/05/11 14:20:52 BGP: [WMCA1-27995] RID change : vrf VRF default(0), RTR ID 17.0.0.2
2023/05/11 14:20:52 BGP: [HKBB3-YX6A9] Rx Intf add VRF 0 IF eth0
2023/05/11 14:20:52 BGP: [GYPW0-GVZQ8] Rx Intf address add VRF 0 IF eth0 addr 17.0.0.2/8
2023/05/11 14:20:52 BGP: [HKBB3-YX6A9] Rx Intf add VRF 0 IF lo
2023/05/11 14:20:52 BGP: [MTH7E-8CG2C] Label Chunk assign: 16 - 143 (0)
2023/05/11 14:21:52 BGP: [WNKP5-SN018] Found existing bnc 17.222.0.254/32(0)(VRF default) flags 0xa ifindex 0 #paths 0 peer 0x7f212afd7010
2023/05/11 14:23:52 BGP: [WNKP5-SN018] Found existing bnc 17.222.0.254/32(0)(VRF default) flags 0xa ifindex 0 #paths 0 peer 0x7f212afd7010

Configuration :

# show  running-config
Building configuration...

Current configuration:
!
frr version 8.5.1
frr defaults traditional
hostname 81ce0b40637c
log syslog informational
no ipv6 forwarding
service integrated-vtysh-config
!
router bgp 100
 no bgp suppress-duplicates
 no bgp hard-administrative-reset
 no bgp graceful-restart notification
 no bgp network import-check
 neighbor 17.222.0.254 remote-as 100
 !
 address-family ipv4 unicast
  network 200.0.1.1/32
  network 200.0.1.2/32
  network 200.0.1.3/32
  network 200.0.1.4/32
  network 200.0.1.5/32
  neighbor 17.222.0.254 route-map DENY_ALL in
 exit-address-family
exit
!
route-map DENY_ALL deny 1
exit
!
end

Describe the bug

To Reproduce Once BGP neighbor-ship is established bring down interface of peer device by issuing "shutdown" and then "no shutdown"

Expected behavior BGP should re-establish the neighbor ship and zebra should not crash

Versions

tlsalmin commented 1 year ago

Same here. link_ifp is NULL

(gdb) p link_ifp $1 = (struct interface ) 0x0 (gdb) p zif $3 = {ifp = 0x55555599caa0, flags = 0, shutdown = 2 '\002', multicast = 0 '\000', mpls = false, linkdown = false, linkdownv6 = false, v4mcast_on = false, v6mcast_on = false, rtadv_enable = 0 '\000', ipv4_subnets = 0x55555599cf00, nhg_dependents = {rr = {rbt_root = 0x555555af9c10, count = 2}}, up_count = 1, up_last = "2023/06/07 14:13:02.15", '\000' <repeats 17 times>, down_count = 0, down_last = '\000' <repeats 39 times>, rtadv = {AdvSendAdvertisements = 0, MaxRtrAdvInterval = 600000, MinRtrAdvInterval = 198000, AdvIntervalTimer = 0, AdvManagedFlag = 0, lastadvmanagedflag = {tv_sec = 0, tv_usec = 0}, AdvOtherConfigFlag = 0, lastadvotherconfigflag = { tv_sec = 0, tv_usec = 0}, AdvLinkMTU = 0, AdvReachableTime = 0, lastadvreachabletime = {tv_sec = 0, tv_usec = 0}, AdvRetransTimer = 0, lastadvretranstimer = {tv_sec = 0, tv_usec = 0}, AdvCurHopLimit = 64, lastadvcurhoplimit = {tv_sec = 0, tv_usec = 0}, AdvDefaultLifetime = -1, prefixes = {{rr = {rbt_root = 0x55555599f140, count = 1}}}, AdvHomeAgentFlag = 0, HomeAgentPreference = 0, HomeAgentLifetime = -1, AdvIntervalOption = 0, DefaultPreference = 0, AdvRDNSSList = 0x55555599bf40, AdvDNSSLList = 0x55555599ced0, UseFastRexmit = true, inFastRexmit = 0 '\000', ra_configured = 0 '\000', NumFastReXmitsRemain = 0}, ra_sent = 0, ra_rcvd = 0, irdp = 0x0, ptm_enable = 0 '\000', zif_type = ZEBRA_IF_MACVLAN, zif_slave_type = ZEBRA_IF_SLAVE_NONE, l2info = {br = {vlan_aware = 0 '\000'}, vl = {vid = 0}, vxl = {vni = 0, vtep_ip = {s_addr = 0}, access_vlan = 0, mcast_grp = {s_addr = 0}, ifindex_link = 0, link_nsid = 0}, gre = {vtep_ip = {s_addr = 0}, vtep_ip_remote = {s_addr = 0}, ikey = 0, okey = 0, ifindex_link = 0, link_nsid = 0}}, brslave_info = {bridge_ifindex = 0, br_if = 0x0, ns_id = 0}, bondslave_info = {bond_ifindex = 0, bond_if = 0x0}, bond_info = {mbr_zifs = 0x0}, es_info = {sysmac = {octet = "\000\000\000\000\000"}, lid = 0, esi = {val = "\000\000\000\000\000\000\000\000\000"}, df_pref = 0, flags = 0 '\000', es = 0x0}, vlan_bitmap = {data = 0x0, n = 0, m = 0}, protodown_rc = 0, mac_list = 0x0, link_ifindex = 32, link = 0x0, speed_update_count = 0 '\000', speed_update = 0x55555599cf40, v6_2_v4_ll_neigh_entry = false, neigh_mac = "\000\000\000\000\000", v6_2_v4_ll_addr6 = {in6_u = {__u6_addr8 = '\000' <repeats 15 times>, u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, desc = 0x0}

github-actions[bot] commented 11 months ago

This issue is stale because it has been open 180 days with no activity. Comment or remove the autoclose label in order to avoid having this issue closed.

frrbot[bot] commented 11 months ago

This issue will be automatically closed in the specified period unless there is further activity.

tlsalmin commented 10 months ago

The problem seems to be that whenever there's a link down/up change in an interface which is a macvlan, where the parent interface isn't visible in the network namespace the zebra process is running, the zebra_if_update_link call will leave zif->link as null, which will then crash in if_up->zebra_vxlan_macvlan_up as that assumes the link pointer is non-null.

tlsalmin commented 10 months ago

Fixing in https://github.com/FRRouting/frr/pull/15010