Open jpsenior opened 3 years ago
I could imagine some sort of fix in rt_netlink.c for netlink_handle_5549
where an index for each used v6_2_v4_ll_neigh_entry, with an offset passed to zebra/interface.c/if_nbr_ipv6ll_to_ipv4ll_neigh_update
, which is passed to if_nbr_mac_to_ipv4ll_neigh_update
- Currently this is hard coded IP address char buf[16] = "169.254.0.1";
, this could include an offset to start at 169.254.0.1 initially, and using an offset integer could add multiple, non-overlapping APIPA addresses.
The bug bascially boils down to exactly here -> https://github.com/FRRouting/frr/blob/3d4b999fab50a4f08d2c4257ec059218a90ed29f/zebra/interface.c#L889
Where the address is hard-coded.
The current implementation has been focused around p2p links, with the assumption that there will only ever be RAs received from one other router, so I can't say I'm surprised that this is the behavior you're encountering.
To date, I don't believe there has been any code changes or plans made to accommodate unnumbered peering with multiple devices on the same network segment (maybe @donaldsharp knows of something I don't?).
I imagine there would be a little more work required than just using sequential link-local v4 addresses. The main thing that sticks out to me would be having proper handling for multiple dynamic peers created from a single neighbor statement (similar to dynamic listen range tied to a peer-group). Not sure how difficult that would be, but it would need the proper consideration to avoid causing issues in situations that assume an interface-based peer is singular.
@taspelund Yes, I can understand this is almost a differentiation between a feature request and a bug :) I agree that there could be more to adding support for this feature than just deterministic next-hop apipa. We're experimenting with other NOS vendors on this behavior as well (Cisco, Juniper, Arista).
@ddutt and myself talked about this a few years back. I think we worked out a way to approach the issue but we have never gotten around to implementing this. I would need to rewrap my head around thiis problem space again before I could comment though.
I just opened issue #9465 and then saw this one. @taspelund has the right idea about bgp listen IFNAME
and neighbor IFNAME...
being equivalent in this case. FWIW, let me add my vote here for this "feature". :)
For reference, some other vendor behavior:
Juniper Junos 21.1R2 (New feature using peer-auto-discovery): Picks first IPv6 ND neighbor (does not consult ra-bit) for BGP session. Ignores all subsequent RAs. Arista EOS (Tested as of 4.24M: Picks first IPv6 RA neighbor, BGP FSM state rejects subsequent neighbors Cisco NXOS (As of NXOS 9.3.7 which adds new support for interface-based bgp peers): Picks first IPv6 RA neighbor, BGP FSM state rejects subsequent neighbors
In comparison to FRR, which will immediately reprogram subsequent IPv6 RA Neighbors.
Multiple IPv6 unnumbered neighbors does not seem to be supported on other major NOS vendors either.
For reference, some other vendor behavior:
Juniper Junos 21.1R2 (New feature using peer-auto-discovery): Picks first IPv6 ND neighbor (does not consult ra-bit) for BGP session. Ignores all subsequent RAs. Arista EOS (Tested as of 4.24M: Picks first IPv6 RA neighbor, BGP FSM state rejects subsequent neighbors Cisco NXOS (As of NXOS 9.3.7 which adds new support for interface-based bgp peers): Picks first IPv6 RA neighbor, BGP FSM state rejects subsequent neighbors
In comparison to FRR, which will immediately reprogram subsequent IPv6 RA Neighbors.
Multiple IPv6 unnumbered neighbors does not seem to be supported on other major NOS vendors either.
What do you mean by FRR "immediately reprogramming subsequent IPv6 RA Neighbors?" I thought FRR also only considers the first RA to do the BGP peering on a network/interface and ignores subsequent RA's.
@network8472 this behavior (immediately reprogramming subsequent neighbors) is the entirety of this bug report, observe the APIPA next-hop and ARP rewrite behavior.
@network8472 this behavior (immediately reprogramming subsequent neighbors) is the entirety of this bug report, observe the APIPA next-hop and ARP rewrite behavior.
Understood. I misinterpreted your statement as applying to BGP neighbors instead of the ARP rewrite behavior. However, I'm not sure if this rewrite behavior is problematic from the perspective of multiple BGP neighbors as it's the interface associated with the MAC address that ultimately is used to send the packet, right?
If FRR is configured with interface-based link local BGP peering (no unicast /30 or /32 ipv4 addresses), FRR uses the IPV6 neighbor cache to do dynamic peer discovery. The first IPv6 RA that FRR receives immediately establishes a BGP session. If a second IPv6 RA is learned, the first is torn down when the ipv6 ra is received, and the first goes down, and then back up again, and down again on subsequent receipt of IPV6 router advertisements from each neighbor on that broadcast domain.
The interface-based bgp peering feature is really only meant for point-to-point BGP sessions right now where you need to guarantee only one device can send you an IPV6 Router advertisement for peer discovery. Presence of more than 1 neighbor causes traffic and bgp disruption.
That is to say, you can't do 'interface bridge.300' or 'interface eth4' when there is more than one IPv6-RA speaker on those links. Dynamic peer discovery doesn't work for multiple peers on the same interface due to this bug.
If FRR is configured with interface-based link local BGP peering (no unicast /30 or /32 ipv4 addresses), FRR uses the IPV6 neighbor cache to do dynamic peer discovery. The first IPv6 RA that FRR receives immediately establishes a BGP session. If a second IPv6 RA is learned, the first is torn down when the ipv6 ra is received, and the first goes down, and then back up again, and down again on subsequent receipt of IPV6 router advertisements from each neighbor on that broadcast domain.
The interface-based bgp peering feature is really only meant for point-to-point BGP sessions right now where you need to guarantee only one device can send you an IPV6 Router advertisement for peer discovery. Presence of more than 1 neighbor causes traffic and bgp disruption.
That is to say, you can't do 'interface bridge.300' or 'interface eth4' when there is more than one IPv6-RA speaker on those links. Dynamic peer discovery doesn't work for multiple peers on the same interface due to this bug.
Thanks for the explanation. My experience in the lab is slightly different where subsequent RA's and BGP Open messages are simply ignored and don't disrupt the first established session. So going forward, I guess what I'd like to know is if there's a good reason for not permitting multiple BGP peerings on the same interface. My use-case as I explained in issue #9465 is basically the "bgp listen" use-case, except with LLA.
Is this still the case? Is there a workaround?
Running Cumulus Linux 4.2.1, with FRR 7.4+cl4.2.1u1 on kernel version 4.19.0-cl-1-amd64.
Also observed on FRR 7.5.1, on Ubuntu 16.04.4 LTS w/ Kernel version 4.4.0-116-generic
In this setup, IPv6 BGP unnumbered is present between Leaf1 and a server (3) in the diagram.
VLAN 300 is carried across an MLAG peer-link providing L2 Adjancency.
FRR configuration - I've removed some extraneous irrelevance like EVPN mappings and other bgp peers or VRFs, the issue is specific to unnumbered peering.
When FRR receives multiple IPv6 RA and BGP comes up, Zebra installs an APIPA 169.254.0.1 static ARP entry in the kernel when there are routes to forward, which is used to carry next-hop information for the ipv4 routes across the IPv6 link - This is expected.
However, when multiple devices with unique IPv6 link local addresses come online, Zebra overwrites this 169.254.0.1 static ARP entry immediately.
Expected behavior would be to install a unique 169.254.0.0/24 address for each unique IPv6 link local next-hop, not re-use the same 169.254.0.1.
We can observe this behavior with the following debug: debug zebra kernel debug zebra updates
Which exhibits log entries oscillating back and forth as follows:
We can watch the ARP responses also oscillate back and forth in the kernel with
ip monitor neigh dev bridge.300
This causes traffic disruption for these routes:
Shutting down bridge.300 on leaf2 here prevents the duplicate Ipv6 RA, and the network immediately stabilizes since the ARP for next-hop isn't consistently overwriting. In RFC5549, my understanding is the behavior should just be 'pick a next-hop'.
I think a more correct behavior here would be to start at 169.254.0.0/24, and dispatch multiple APIPA next-hop ARP entries for each unique IPv6 LL next-hop: Eg, 169.254.0.1 and 169.254.0.2, instead of immediately overwriting what is there.