FRRouting / frr

The FRRouting Protocol Suite
https://frrouting.org/
Other
3.21k stars 1.24k forks source link

IPV6 multipath routes not be installed after flap another interfaces #14160

Open alisenkov opened 1 year ago

alisenkov commented 1 year ago

Hello everyone, we have problem with ECMP routes on Ubuntu 22.04, we tried FRR version 8.5, 7.1, 9.1 - problem reproduced on all versions.

config FRR:

Current configuration:
!
frr version 9.1-dev999
frr defaults traditional
hostname front3
log file /var/log/frr/frr-debug.log
log syslog warnings
no ipv6 forwarding
service integrated-vtysh-config
!
debug zebra kernel
debug zebra nexthop
!
router bgp 65119
 bgp router-id 100.64.1.1
 bgp bestpath as-path multipath-relax
 timers bgp 10 30
 neighbor pgv6-0 peer-group
 neighbor pgv6-0 remote-as 65110
 neighbor enp59s0 interface peer-group pgv6-0
 neighbor enp59s0d1 interface peer-group pgv6-0
 !
 address-family ipv4 unicast
  network 100.64.1.1/32
  neighbor pgv6-0 soft-reconfiguration inbound
  neighbor pgv6-0 route-map allow-all in
  neighbor pgv6-0 route-map Lo out
 exit-address-family
exit
!
ip prefix-list Lo0 seq 5 permit 100.64.1.1/32
!
route-map Lo permit 10
 match ip address prefix-list Lo0
exit
!
route-map Lo deny 100

route-map allow-all permit 10
exit
!

The server has two 10GE-interfaces (enp59s0, enp59s0d1) for transmit production traffic and one(eno1) for OOB(management). We use IPV6 ND for links between Server and L3-SW, on this link establish EBGP-peering.

When normal situation - FRR and kernel have two ECMP default route to EBGP-neighbors:

front3# sh ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

B>* 0.0.0.0/0 [20/0] via fe80::c2d6:82ff:fef4:3cc4, enp59s0, weight 1, 00:23:01
                     via fe80::c2d6:82ff:fef4:428c, enp59s0d1, weight 1, 00:23:01

And this multipath default route correct install to kernel:

root@front3:/home/admin# ip ro
default nhid 22 proto bgp metric 20
    nexthop via inet6 fe80::c2d6:82ff:fef4:3cc4 dev enp59s0 weight 1
    nexthop via inet6 fe80::c2d6:82ff:fef4:428c dev enp59s0d1 weight 1

root@front3:/home/admin# ip nexthop ls
id 28 dev lo scope link proto zebra
id 29 dev br-f59be21aad6b scope link proto zebra
id 33 dev enp59s0 scope link proto zebra
id 34 group 35/36 proto zebra
id 35 via fe80::c2d6:82ff:fef4:3cc4 dev enp59s0 scope link proto zebra
id 36 via fe80::c2d6:82ff:fef4:428c dev enp59s0d1 scope link proto zebra

BUT, when we shutdown/recable interface eno1, we seems that FRR cant push same routes(default) to kernel, kernel lost default route and server becomes inaccessible for production traffic.

important point - If we do not use multipath, only one interface - after flap route correct installed to kernel, PROBLEM ACTUAL WHEN MULTIPATH USE.

also helps "service frr restart" - after restarting FRR can correct install ECMP default route to kernel:


root@front3:/home/alisenkov# ip ro
10.177.1.0/24 dev eno1 proto kernel scope link src 10.177.1.14 metric 100
10.177.1.254 dev eno1 proto dhcp scope link src 10.177.1.14 metric 100
root@front3:/home/alisenkov# service frr restart
root@front3:/home/alisenkov# ip ro
**default nhid 113 proto bgp metric 20
    nexthop via inet6 fe80::c2d6:82ff:fef4:3cc4 dev enp59s0 weight 1
    nexthop via inet6 fe80::c2d6:82ff:fef4:428c dev enp59s0d1 weight 1**
10.177.1.0/24 dev eno1 proto kernel scope link src 10.177.1.14 metric 100
10.177.1.254 dev eno1 proto dhcp scope link src 10.177.1.14 metric 100
root@front3:/home/alisenkov#

logs:

zebra logs:
2023/08/08 13:07:07 ZEBRA: [Z9NFT-4F8FQ] Intf eno1(2) has gone DOWN
2023/08/08 13:07:07 ZEBRA: [K8FXY-V65ZJ] Intf dplane ctx 0x7f0448027130, op INTF_ADDR_DEL, ifindex (2), result QUEUED
2023/08/08 13:07:07 ZEBRA: [MZPZA-W042K] zebra_if_addr_update_ctx: INTF_ADDR_DEL: ifindex eno1(2), addr fe80::4ed9:8fff:fe41:cc34/64
2023/08/08 13:07:07 ZEBRA: [KMXEB-K771Y] netlink_parse_info: netlink-listen (NS 0) type RTM_DELNEXTHOP(105), len=60, seq=15117091, pid=1
2023/08/08 13:07:07 ZEBRA: [KBCV5-6W9G6] RTM_DELNEXTHOP ID (16)  NS 0
2023/08/08 13:07:07 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (16[17/18]) that we are still using for a route, sending it back down
2023/08/08 13:07:07 ZEBRA: [VNMVB-91G3G] _netlink_nexthop_build_group: ID (16): group 17/18
2023/08/08 13:07:07 ZEBRA: [R43C6-KYHWT] netlink_nexthop_msg_encode: RTM_NEWNEXTHOP, id=16

journalctl:
Aug 08 12:34:39 front3 frrinit.sh[5715]: [5715|zebra] sending configuration
Aug 08 12:34:39 front3 frrinit.sh[5715]: [5715|zebra] done
Aug 08 12:34:39 front3 watchfrr[5682]: [QDG3Y-BY5TN] zebra state -> up : connect succeeded
Aug 08 13:00:14 front3 zebra[5697]: [HSYZM-HV7HF] Extended Error: Nexthop device is not up
Aug 08 13:00:14 front3 zebra[5697]: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: Network is down, type=RTM_NEWNEXTHOP(104), seq=16, pid=2335995380
Aug 08 13:00:14 front3 zebra[5697]: [X5XE1-RS0SW][EC 4043309074] Failed to install Nexthop (21[if 2 vrfid 0]) into the kernel
Aug 08 13:00:14 front3 zebra[5697]: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (22[23/24]) that we are still using for a route, sending it back down
Aug 08 13:00:27 front3 zebra[5697]: [N5M5Y-J5BPG][EC 4043309121] Client 'static' (session id 0) encountered an error and is shutting down.
Aug 08 13:00:27 front3 zebra[5697]: [KQB7H-NPVW9] zebra/zebra_ptm.c:1333 failed to find process pid registration
Aug 08 13:00:27 front3 zebra[5697]: [KQB7H-NPVW9] zebra/zebra_ptm.c:1333 failed to find process pid registration

Aug 08 13:00:14 front3 zebra[5697]: [X5XE1-RS0SW][EC 4043309074] Failed to install Nexthop (21[if 2 vrfid 0]) into the kernel

Click to expand 2023/08/08 12:12:45 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (25[if 6 vrfid 0]) that we are still using for a route, sending it back down 2023/08/08 12:12:45 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (25[if 6 vrfid 0]) that we are still using for a route, sending it back down 2023/08/08 12:12:45 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (33[if 2 vrfid 0]) that we are still using for a route, sending it back down 2023/08/08 12:12:45 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (38[if 2 vrfid 0]) that we are still using for a route, sending it back down 2023/08/08 12:12:45 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (37[if 2 vrfid 0]) that we are still using for a route, sending it back down 2023/08/08 12:12:45 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (25[if 6 vrfid 0]) that we are still using for a route, sending it back down 2023/08/08 12:12:45 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (33[if 2 vrfid 0]) that we are still using for a route, sending it back down 2023/08/08 12:12:45 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (38[if 2 vrfid 0]) that we are still using for a route, sending it back down 2023/08/08 12:12:45 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (37[if 2 vrfid 0]) that we are still using for a route, sending it back down 2023/08/08 12:15:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (36[37/38]) that we are still using for a route, sending it back down 2023/08/08 13:00:14 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (22[23/24]) that we are still using for a route, sending it back down 2023/08/08 13:07:07 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (16[17/18]) that we are still using for a route, sending it back down 2023/08/08 13:50:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (49[50/51]) that we are still using for a route, sending it back down 2023/08/08 13:50:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (51[fe80::c2d6:82ff:fef4:428c if 7 vrfid 0]) that we are still using for a route, sending it back down 2023/08/08 13:50:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel updated a nexthop group with ID (49[50/51]) that we are still using for a route, sending it back down 2023/08/08 13:50:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (49[50/51]) that we are still using for a route, sending it back down 2023/08/08 13:50:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (51[fe80::c2d6:82ff:fef4:428c if 7 vrfid 0]) that we are still using for a route, sending it back down 2023/08/08 13:50:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel updated a nexthop group with ID (49[50/51]) that we are still using for a route, sending it back down 2023/08/08 13:50:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (49[50/51]) that we are still using for a route, sending it back down 2023/08/08 13:50:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (49[50/51]) that we are still using for a route, sending it back down 2023/08/08 13:50:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (50[fe80::c2d6:82ff:fef4:3cc4 if 6 vrfid 0]) that we are still using for a route, sending it back down 2023/08/08 13:50:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel updated a nexthop group with ID (49[50/51]) that we are still using for a route, sending it back down 2023/08/08 13:50:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (48[if 6 vrfid 0]) that we are still using for a route, sending it back down 2023/08/08 13:50:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (49[50/51]) that we are still using for a route, sending it back down 2023/08/08 13:50:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (50[fe80::c2d6:82ff:fef4:3cc4 if 6 vrfid 0]) that we are still using for a route, sending it back down 2023/08/08 13:50:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel updated a nexthop group with ID (49[50/51]) that we are still using for a route, sending it back down 2023/08/08 13:50:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (48[if 6 vrfid 0]) that we are still using for a route, sending it back down 2023/08/08 13:50:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (49[50/51]) that we are still using for a route, sending it back down 2023/08/08 13:50:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (43[if 1 vrfid 0]) that we are still using for a route, sending it back down 2023/08/08 13:50:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (49[50/51]) that we are still using for a route, sending it back down 2023/08/08 13:50:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (49[50/51]) that we are still using for a route, sending it back down 2023/08/08 13:50:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (49[50/51]) that we are still using for a route, sending it back down 2023/08/08 13:50:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (49[50/51]) that we are still using for a route, sending it back down 2023/08/08 13:50:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (51[fe80::c2d6:82ff:fef4:428c if 7 vrfid 0]) that we are still using for a route, sending it back down 2023/08/08 13:50:02 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel updated a nexthop group with ID (49[50/51]) that we are still using for a route, sending it back down 2023/08/08 13:50:03 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (50[fe80::c2d6:82ff:fef4:3cc4 if 6 vrfid 0]) that we are still using for a route, sending it back down 2023/08/08 13:50:03 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel updated a nexthop group with ID (49[50/51]) that we are still using for a route, sending it back down 2023/08/08 13:50:03 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (48[if 6 vrfid 0]) that we are still using for a route, sending it back down 2023/08/08 13:50:03 ZEBRA: [RG2NH-FTSDH][EC 4043309102] Kernel deleted a nexthop group with ID (49[50/51]) that we are still using for a route, sending it back down Whatever

For some reason FRR is unable to set the route to the kernel, please help me to understand the reason...

ton31337 commented 1 year ago

BUT, when we shutdown/recable interface eno1, we seems that FRR cant push same routes(default) to kernel, kernel lost default route and server becomes inaccessible for production traffic.

You mean disabling OOB link, default routes disappear from the kernel and are not reinstalled back again until you restart FRR?

alisenkov commented 1 year ago

You mean disabling OOB link, default routes disappear from the kernel and are not reinstalled back again until you restart FRR?

yes, default ecmp route from bgp with ipv6 nexthop reinstalled back only when restart FRR.

alisenkov commented 1 year ago

we found interest issue - https://www.spinics.net/lists/netdev/msg863121.html

> > > > > when an IPv4 route gets removed because its nexthop was deleted, the
> > > > > kernel does not send a RTM_DELROUTE netlink notifications anymore in
> > > > > 6.1. A bisect lead me to 61b91eb33a69 ("ipv4: Handle attempt to delete
> > > > > multipath route when fib_info contains an nh reference"), and
> > > > > reverting it makes it work again.
ton31337 commented 1 year ago

Can you show the full ip route output, also show ip route, show ip bgp, because I don't see any interfaces, routes related to eno1 interface.

alisenkov commented 1 year ago

Can you show the full ip route output, also show ip route, show ip bgp, because I don't see any interfaces, routes related to eno1 interface.


s1# sh ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
f - OpenFabric,
- selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure

B>* 0.0.0.0/0 [20/0] via fe80::c2d6:82ff:fef4:3cc4, ens1f1np1, weight 1, 01w5d10h

Displayed 4 routes and 5 total paths s1# exit root@s1:/home/alisenkov# ip ro default nhid 68 via inet6 fe80::c2d6:82ff:fef4:3cc4 dev ens1f1np1 proto bgp metric 20. <<<<< its also strange because FRR has 2 ECMP routes to default, but kernel has only one 10.4.0.0/16 via 10.177.1.254 dev eno1 proto dhcp src 10.177.1.31 metric 100 10.177.1.0/24 dev eno1 proto kernel scope link src 10.177.1.31 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 198.18.0.0/15 via 10.177.1.254 dev eno1 proto dhcp src 10.177.1.31 metric 100 root@s1:/home/alisenkov#

root@s1:/home/alisenkov# ip nexthop id 7 dev lo scope host proto zebra id 10 dev ens1f0np0 scope link proto zebra id 27 dev eno1 scope host proto zebra id 28 via 10.177.1.254 dev eno1 scope link proto zebra id 35 via fe80::c2d6:82ff:fef4:3cc4 dev ens1f1np1 scope link proto zebra id 51 via fe80::c2d6:82ff:fef4:428c dev ens1f0np0 scope link proto zebra id 68 group 35 proto zebra root@s1:/home/alisenkov#

osfrickler commented 11 months ago

I'm seeing the same issue. To me it looks like zebra is reinstalling the nexthop groups, but the routes that were using the nhgs are not getting restored.

Some more possibly interesting notes:

akunszt commented 11 months ago

It is a systemd-networkd issue at the first place. It removes the nexthop groups unconditionally. IDK which version introduced this bug but it is present in v252 and (very likely) newer versions. Please check: https://github.com/systemd/systemd/issues/29034

huangfeilong1 commented 9 months ago

May have fixed it:https://github.com/FRRouting/frr/pull/14080

sysoleg commented 9 months ago

@alisenkov The problem seems to be related to systemd-networkd. What is the normal output of your networkctl? Please, check how your case relates to this scenario: https://github.com/systemd/systemd/issues/29034#issuecomment-1834155593

sysoleg commented 9 months ago

@huangfeilong1 Seems completely unrelated.

modzilla99 commented 9 months ago

May have fixed it:#14080

Thanks, this fixed it for us (FRR 8.5 on Ubuntu 20.04)!

Interestingly, we don't face that problem with the system packages of systemd and frr on 22.04.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 180 days with no activity. Comment or remove the autoclose label in order to avoid having this issue closed.