Point-to-Multipoint RFC5549 ipv6 unnumbered peering oscillation with overwritten 169.254.0.1 APIPA ARP causes traffic disruption

jpsenior commented 3 years ago

Running Cumulus Linux 4.2.1, with FRR 7.4+cl4.2.1u1 on kernel version 4.19.0-cl-1-amd64.

Also observed on FRR 7.5.1, on Ubuntu 16.04.4 LTS w/ Kernel version 4.4.0-116-generic

In this setup, IPv6 BGP unnumbered is present between Leaf1 and a server (3) in the diagram.

VLAN 300 is carried across an MLAG peer-link providing L2 Adjancency.

FRR configuration - I've removed some extraneous irrelevance like EVPN mappings and other bgp peers or VRFs, the issue is specific to unnumbered peering.

vrf blue
 vni 20002
 exit-vrf
!
interface bridge.300 vrf blue
 ipv6 nd ra-interval 30
 no ipv6 nd suppress-ra
!
router bgp 64514 vrf blue
 bgp router-id 10.0.0.5
 neighbor l3rtr peer-group
 neighbor l3rtr advertisement-interval 0
 neighbor l3rtr timers connect 5
 neighbor bridge.300 interface peer-group l3rtr
 neighbor bridge.300 remote-as 64517
 neighbor bridge.300 description facing_switch1-server1
 neighbor bridge.300 bfd
 !
 address-family ipv4 unicast
  neighbor l3rtr next-hop-self
  # note: peer is implicitly activated due to `bgp default ipv4-unicast` implicitly activated
  no neighbor l3rtr send-community extended
 exit-address-family
 !
 address-family l2vpn evpn
  advertise ipv4 unicast
  advertise ipv6 unicast
  rd 10.0.0.2:201
  route-target import 20002:1
  route-target export 20002:1
 exit-address-family
!

When FRR receives multiple IPv6 RA and BGP comes up, Zebra installs an APIPA 169.254.0.1 static ARP entry in the kernel when there are routes to forward, which is used to carry next-hop information for the ipv4 routes across the IPv6 link - This is expected.

However, when multiple devices with unique IPv6 link local addresses come online, Zebra overwrites this 169.254.0.1 static ARP entry immediately.

Expected behavior would be to install a unique 169.254.0.0/24 address for each unique IPv6 link local next-hop, not re-use the same 169.254.0.1.

We can observe this behavior with the following debug: debug zebra kernel debug zebra updates

Which exhibits log entries oscillating back and forth as follows:

2021/05/13 19:19:54.324661 ZEBRA: MESSAGE: ZEBRA_INTERFACE_NBR_ADDRESS_ADD fe80::5054:ff:fe6d:c0b8/128 on bridge.300
2021/05/13 19:19:54.325078 ZEBRA: netlink_talk: netlink-cmd (NS 0) type RTM_DELNEIGH(29), len=56 seq=98552 flags 0x401
2021/05/13 19:19:54.325399 ZEBRA: netlink_talk: netlink-cmd (NS 0) type RTM_NEWNEIGH(28), len=56 seq=98553 flags 0x401
2021/05/13 19:19:54.325479 ZEBRA: netlink_parse_info: netlink-listen (NS 0) type RTM_NEWNEIGH(28), len=88, seq=0, pid=0
2021/05/13 19:19:54.325493 ZEBRA:       Neighbor Entry received is not on a VLAN or a BRIDGE, ignoring
2021/05/13 19:19:54.325503 ZEBRA: netlink_parse_info: netlink-listen (NS 0) type RTM_DELNEIGH(29), len=72, seq=0, pid=0
2021/05/13 19:19:54.325513 ZEBRA: Rx RTM_DELNEIGH family ipv4 IF bridge.300(96) vrf blue(12) IP 169.254.0.1

2021/05/13 19:19:54.336829 ZEBRA: MESSAGE: ZEBRA_INTERFACE_NBR_ADDRESS_ADD fe80::5054:ff:fe99:4674/128 on bridge.300
2021/05/13 19:19:54.336939 ZEBRA: netlink_talk: netlink-cmd (NS 0) type RTM_DELNEIGH(29), len=56 seq=98558 flags 0x401
2021/05/13 19:19:54.336984 ZEBRA: netlink_talk: netlink-cmd (NS 0) type RTM_NEWNEIGH(28), len=56 seq=98559 flags 0x401
2021/05/13 19:19:54.337014 ZEBRA: netlink_parse_info: netlink-listen (NS 0) type RTM_DELNEIGH(29), len=72, seq=0, pid=0
2021/05/13 19:19:54.337025 ZEBRA: Rx RTM_DELNEIGH family ipv4 IF bridge.300(96) vrf blue(12) IP 169.254.0.1

We can watch the ARP responses also oscillate back and forth in the kernel with ip monitor neigh dev bridge.300

root@leaf1:mgmt:~# ip monitor neigh dev bridge.300
169.254.0.1  FAILED proto zebra
Deleted 169.254.0.1  FAILED proto zebra
169.254.0.1 lladdr 52:54:00:6d:c0:b8 PERMANENT
169.254.0.1  FAILED proto zebra
Deleted 169.254.0.1  FAILED proto zebra
169.254.0.1 lladdr 52:54:00:99:46:74 PERMANENT
169.254.0.1  FAILED proto zebra
Deleted 169.254.0.1  FAILED proto zebra
169.254.0.1 lladdr 52:54:00:6d:c0:b8 PERMANENT
169.254.0.1  FAILED proto zebra
Deleted 169.254.0.1  FAILED proto zebra
169.254.0.1 lladdr 52:54:00:99:46:74 PERMANENT
169.254.0.1  FAILED proto zebra
Deleted 169.254.0.1  FAILED proto zebra
169.254.0.1 lladdr 52:54:00:6d:c0:b8 PERMANENT
169.254.0.1  FAILED proto zebra
Deleted 169.254.0.1  FAILED proto zebra
169.254.0.1 lladdr 52:54:00:99:46:74 PERMANENT

This causes traffic disruption for these routes:

admin@leaf3:mgmt:~$ ping -I blue 1.1.1.1
vrf-wrapper.sh: switching to vrf "default"; use '--no-vrf-switch' to disable
PING 1.1.1.1 (1.1.1.1) from 10.0.0.7 blue: 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=63 time=1.22 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=63 time=0.559 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=63 time=0.625 ms
64 bytes from 1.1.1.1: icmp_seq=4 ttl=63 time=0.614 ms
64 bytes from 1.1.1.1: icmp_seq=5 ttl=63 time=0.763 ms
64 bytes from 1.1.1.1: icmp_seq=6 ttl=63 time=0.583 ms
64 bytes from 1.1.1.1: icmp_seq=7 ttl=63 time=0.585 ms
From 10.0.0.5 icmp_seq=8 Time to live exceeded
64 bytes from 1.1.1.1: icmp_seq=15 ttl=63 time=0.603 ms
64 bytes from 1.1.1.1: icmp_seq=16 ttl=63 time=0.542 ms
64 bytes from 1.1.1.1: icmp_seq=17 ttl=63 time=0.577 ms
64 bytes from 1.1.1.1: icmp_seq=18 ttl=63 time=0.592 ms
64 bytes from 1.1.1.1: icmp_seq=19 ttl=63 time=0.567 ms
64 bytes from 1.1.1.1: icmp_seq=20 ttl=63 time=8.50 ms
64 bytes from 1.1.1.1: icmp_seq=21 ttl=63 time=0.554 ms
64 bytes from 1.1.1.1: icmp_seq=22 ttl=63 time=0.577 ms
From 10.0.0.5 icmp_seq=23 Time to live exceeded
From 10.0.0.5 icmp_seq=24 Time to live exceeded
From 10.0.0.5 icmp_seq=25 Time to live exceeded
From 10.0.0.5 icmp_seq=26 Time to live exceeded
From 10.0.0.5 icmp_seq=27 Time to live exceeded
64 bytes from 1.1.1.1: icmp_seq=29 ttl=63 time=0.576 ms
64 bytes from 1.1.1.1: icmp_seq=30 ttl=63 time=0.634 ms

Shutting down bridge.300 on leaf2 here prevents the duplicate Ipv6 RA, and the network immediately stabilizes since the ARP for next-hop isn't consistently overwriting. In RFC5549, my understanding is the behavior should just be 'pick a next-hop'.

I think a more correct behavior here would be to start at 169.254.0.0/24, and dispatch multiple APIPA next-hop ARP entries for each unique IPv6 LL next-hop: Eg, 169.254.0.1 and 169.254.0.2, instead of immediately overwriting what is there.

jpsenior commented 3 years ago

I could imagine some sort of fix in rt_netlink.c for netlink_handle_5549 where an index for each used v6_2_v4_ll_neigh_entry, with an offset passed to zebra/interface.c/if_nbr_ipv6ll_to_ipv4ll_neigh_update , which is passed to if_nbr_mac_to_ipv4ll_neigh_update - Currently this is hard coded IP address char buf[16] = "169.254.0.1";, this could include an offset to start at 169.254.0.1 initially, and using an offset integer could add multiple, non-overlapping APIPA addresses.

The bug bascially boils down to exactly here -> https://github.com/FRRouting/frr/blob/3d4b999fab50a4f08d2c4257ec059218a90ed29f/zebra/interface.c#L889

Where the address is hard-coded.

taspelund commented 3 years ago

The current implementation has been focused around p2p links, with the assumption that there will only ever be RAs received from one other router, so I can't say I'm surprised that this is the behavior you're encountering.

To date, I don't believe there has been any code changes or plans made to accommodate unnumbered peering with multiple devices on the same network segment (maybe @donaldsharp knows of something I don't?).

I imagine there would be a little more work required than just using sequential link-local v4 addresses. The main thing that sticks out to me would be having proper handling for multiple dynamic peers created from a single neighbor statement (similar to dynamic listen range tied to a peer-group). Not sure how difficult that would be, but it would need the proper consideration to avoid causing issues in situations that assume an interface-based peer is singular.

jpsenior commented 3 years ago

@taspelund Yes, I can understand this is almost a differentiation between a feature request and a bug :) I agree that there could be more to adding support for this feature than just deterministic next-hop apipa. We're experimenting with other NOS vendors on this behavior as well (Cisco, Juniper, Arista).

donaldsharp commented 3 years ago

@ddutt and myself talked about this a few years back. I think we worked out a way to approach the issue but we have never gotten around to implementing this. I would need to rewrap my head around thiis problem space again before I could comment though.

network8472 commented 3 years ago

I just opened issue #9465 and then saw this one. @taspelund has the right idea about bgp listen IFNAME and neighbor IFNAME... being equivalent in this case. FWIW, let me add my vote here for this "feature". :)

jpsenior commented 3 years ago

For reference, some other vendor behavior:

Juniper Junos 21.1R2 (New feature using peer-auto-discovery): Picks first IPv6 ND neighbor (does not consult ra-bit) for BGP session. Ignores all subsequent RAs. Arista EOS (Tested as of 4.24M: Picks first IPv6 RA neighbor, BGP FSM state rejects subsequent neighbors Cisco NXOS (As of NXOS 9.3.7 which adds new support for interface-based bgp peers): Picks first IPv6 RA neighbor, BGP FSM state rejects subsequent neighbors

In comparison to FRR, which will immediately reprogram subsequent IPv6 RA Neighbors.

Multiple IPv6 unnumbered neighbors does not seem to be supported on other major NOS vendors either.

network8472 commented 3 years ago

For reference, some other vendor behavior:

Juniper Junos 21.1R2 (New feature using peer-auto-discovery): Picks first IPv6 ND neighbor (does not consult ra-bit) for BGP session. Ignores all subsequent RAs. Arista EOS (Tested as of 4.24M: Picks first IPv6 RA neighbor, BGP FSM state rejects subsequent neighbors Cisco NXOS (As of NXOS 9.3.7 which adds new support for interface-based bgp peers): Picks first IPv6 RA neighbor, BGP FSM state rejects subsequent neighbors

In comparison to FRR, which will immediately reprogram subsequent IPv6 RA Neighbors.

Multiple IPv6 unnumbered neighbors does not seem to be supported on other major NOS vendors either.

What do you mean by FRR "immediately reprogramming subsequent IPv6 RA Neighbors?" I thought FRR also only considers the first RA to do the BGP peering on a network/interface and ignores subsequent RA's.

jpsenior commented 3 years ago

@network8472 this behavior (immediately reprogramming subsequent neighbors) is the entirety of this bug report, observe the APIPA next-hop and ARP rewrite behavior.

network8472 commented 3 years ago

@network8472 this behavior (immediately reprogramming subsequent neighbors) is the entirety of this bug report, observe the APIPA next-hop and ARP rewrite behavior.

Understood. I misinterpreted your statement as applying to BGP neighbors instead of the ARP rewrite behavior. However, I'm not sure if this rewrite behavior is problematic from the perspective of multiple BGP neighbors as it's the interface associated with the MAC address that ultimately is used to send the packet, right?

jpsenior commented 3 years ago

If FRR is configured with interface-based link local BGP peering (no unicast /30 or /32 ipv4 addresses), FRR uses the IPV6 neighbor cache to do dynamic peer discovery. The first IPv6 RA that FRR receives immediately establishes a BGP session. If a second IPv6 RA is learned, the first is torn down when the ipv6 ra is received, and the first goes down, and then back up again, and down again on subsequent receipt of IPV6 router advertisements from each neighbor on that broadcast domain.

The interface-based bgp peering feature is really only meant for point-to-point BGP sessions right now where you need to guarantee only one device can send you an IPV6 Router advertisement for peer discovery. Presence of more than 1 neighbor causes traffic and bgp disruption.

That is to say, you can't do 'interface bridge.300' or 'interface eth4' when there is more than one IPv6-RA speaker on those links. Dynamic peer discovery doesn't work for multiple peers on the same interface due to this bug.

network8472 commented 3 years ago

If FRR is configured with interface-based link local BGP peering (no unicast /30 or /32 ipv4 addresses), FRR uses the IPV6 neighbor cache to do dynamic peer discovery. The first IPv6 RA that FRR receives immediately establishes a BGP session. If a second IPv6 RA is learned, the first is torn down when the ipv6 ra is received, and the first goes down, and then back up again, and down again on subsequent receipt of IPV6 router advertisements from each neighbor on that broadcast domain.

The interface-based bgp peering feature is really only meant for point-to-point BGP sessions right now where you need to guarantee only one device can send you an IPV6 Router advertisement for peer discovery. Presence of more than 1 neighbor causes traffic and bgp disruption.

That is to say, you can't do 'interface bridge.300' or 'interface eth4' when there is more than one IPv6-RA speaker on those links. Dynamic peer discovery doesn't work for multiple peers on the same interface due to this bug.

Thanks for the explanation. My experience in the lab is slightly different where subsequent RA's and BGP Open messages are simply ignored and don't disrupt the first established session. So going forward, I guess what I'd like to know is if there's a good reason for not permitting multiple BGP peerings on the same interface. My use-case as I explained in issue #9465 is basically the "bgp listen" use-case, except with LLA.

JMF-Networks commented 1 year ago

Is this still the case? Is there a workaround?

FRRouting / frr

Point-to-Multipoint RFC5549 ipv6 unnumbered peering oscillation with overwritten 169.254.0.1 APIPA ARP causes traffic disruption #8668