FRRouting / frr

The FRRouting Protocol Suite
https://frrouting.org/
Other
3.21k stars 1.24k forks source link

MH EBGP session between directly connected eBGP peers using loopback addresses comes up without enabling "disable-connected-check" CLI #15784

Open samanvithab opened 4 months ago

samanvithab commented 4 months ago

Description

Issue: The EBGP session gets established even when either ‘ebgp-multihop’ or ‘disable-connected-check’ is not configured on the DUT. We observed this issue first while evaluating the ‘ebgp-multihop’ and ‘disable-connected-check’ CLI knobs with dynamic peers.

Version

frr(config-router-af)# do show version 
FRRouting 10.1-dev-3049-g692f916b8 (frr) on Linux(4.19.86-041986-lowlatency).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
This is a git build of frr-9.0-dev-1332-g692f916b8
Associated branch(es):
        local:master
        github/samanvithab/frr/master

configured with:
    '--prefix=/usr' '--enable-exampledir=/usr/share/doc/frr/examples/' '--localstatedir=/var/run/frr' '--sbindir=/usr/lib/frr' '--sysconfdir=/etc/frr' '--enable-pimd' '--enable-watchfrr' '--enable-ospfclient=yes' '--enable-ospfapi=yes' '--enable-multipath=64' '--enable-user=frr' '--enable-group=frr' '--enable-vty-group=frrvty' '--enable-configfile-mask=0640' '--enable-logfile-mask=0640' '--enable-rtadv' '--enable-fpm' '--enable-systemd=yes' '--enable-dev-build' '--with-pkg-git-version' '--with-pkg-extra-version=-3049'
frr(config-router-af)#

How to reproduce

Topology details:

——— R1 (DUT) ———— R2 ———

The EBGP session is being established over loopback IP addresses between R1 & R2. Static routes are configured on both R1 & R2 for the reachability of loopback IPs via connected interface addresses.

  1. Dynamic Peer:

DUT R1: router bgp 200 no bgp ebgp-requires-policy no bgp default ipv4-unicast neighbor PGNAME2 peer-group neighbor PGNAME2 remote-as 400 neighbor PGNAME2 update-source 77.0.0.7 bgp listen range 66.0.0.0/24 peer-group PGNAME2

R2: router bgp 400 neighbor 77.0.0.7 remote-as 200 neighbor 77.0.0.7 disable-connected-check neighbor 77.0.0.7 update-source 66.0.0.6

frr(config-router-af)# do show bgp summary

IPv4 Unicast Summary: BGP router identifier 77.0.0.7, local AS number 200 VRF default vrf-id 0 BGP table version 26 RIB entries 11, using 1408 bytes of memory Peers 1, using 20 KiB of memory Peer groups 1, using 64 bytes of memory

Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc *66.0.0.6 4 400 5 4 26 0 0 00:00:36 4 2 N/A

Total number of neighbors 1

frr(config-router-af)# do show bgp neighbors BGP neighbor is 66.0.0.6, remote AS 400, local AS 200, external link Local Role: undefined Remote Role: undefined Hostname: frr Member of peer-group PGNAME2 for session parameters Belongs to the subnet range group: 66.0.0.0/24 BGP version 4, remote router ID 4.4.4.4, local router ID 77.0.0.7 BGP state = Established, up for 00:00:10 Last read 00:00:09, Last write 00:00:09 Hold time is 180 seconds, keepalive interval is 60 seconds Configured hold time is 180 seconds, keepalive interval is 60 seconds Configured tcp-mss is 0, synced tcp-mss is 1448 Configured conditional advertisements interval is 60 seconds Neighbor capabilities: 4 Byte AS: advertised and received Extended Message: advertised AddPath: IPv4 Unicast: RX advertised and received Paths-Limit: IPv4 Unicast: advertised (0) Long-lived Graceful Restart: advertised Route refresh: advertised and received Enhanced Route Refresh: advertised Address Family IPv4 Unicast: advertised and received Hostname Capability: advertised (name: frr,domain name: n/a) received (name: frr,domain name: n/a) Version Capability: not advertised not received Graceful Restart Capability: advertised and received Remote Restart timer is 120 seconds Address families by peer: none Graceful restart information: End-of-RIB send: IPv4 Unicast End-of-RIB received: IPv4 Unicast Local GR Mode: Helper Remote GR Mode: Helper R bit: False N bit: False Timers: Configured Restart Time(sec): 120 Received Restart Time(sec): 120 Configured LLGR Stale Path Time(sec): 0 IPv4 Unicast: F bit: False End-of-RIB sent: Yes End-of-RIB sent after update: Yes End-of-RIB received: Yes Timers: Configured Stale Path Time(sec): 360 LLGR Stale Path Time(sec): 0 Message statistics: Inq depth is 0 Outq depth is 0 Sent Rcvd Opens: 1 1 Notifications: 0 0 Updates: 2 3 Keepalives: 1 1 Route Refresh: 0 0 Capability: 0 0 Total: 4 5 Minimum time between advertisement runs is 0 seconds Update source is 77.0.0.7

For address family: IPv4 Unicast PGNAME2 peer-group member Update group 6, subgroup 6 Packet Queue length 0 Community attribute sent to this neighbor(all) 4 accepted prefixes

Connections established 1; dropped 0 Last reset 00:00:10, No path to specified Neighbor (n/a) External BGP neighbor may be up to 1 hops away. Local host: 77.0.0.7, Local port: 179 Foreign host: 66.0.0.6, Foreign port: 34471 Nexthop: 77.0.0.7 Nexthop global: :: Nexthop local: :: BGP connection: non shared network BGP Connect Retry Timer in Seconds: 120 Estimated round trip time: 0 ms Read thread: on Write thread: on FD used: 30

=============================================================================== Same behaviour is seen with a static peer.

  1. Static peer:

DUT R1: router bgp 200 no bgp ebgp-requires-policy no bgp default ipv4-unicast neighbor 66.0.0.6 remote-as 400 neighbor 66.0.0.6 update-source 77.0.0.7

R2: router bgp 400 neighbor 77.0.0.7 remote-as 200 neighbor 77.0.0.7 disable-connected-check neighbor 77.0.0.7 update-source 66.0.0.6

frr(config-router)# do show bgp neighbors BGP neighbor is 66.0.0.6, remote AS 400, local AS 200, external link Local Role: undefined Remote Role: undefined Hostname: frr BGP version 4, remote router ID 4.4.4.4, local router ID 77.0.0.7 BGP state = Established, up for 00:04:59 Last read 00:00:59, Last write 00:00:59 Hold time is 180 seconds, keepalive interval is 60 seconds Configured hold time is 180 seconds, keepalive interval is 60 seconds Configured tcp-mss is 0, synced tcp-mss is 1448 Configured conditional advertisements interval is 60 seconds Neighbor capabilities: 4 Byte AS: advertised and received Extended Message: advertised AddPath: IPv4 Unicast: RX advertised and received Paths-Limit: IPv4 Unicast: advertised (0) Long-lived Graceful Restart: advertised Route refresh: advertised and received Enhanced Route Refresh: advertised Address Family IPv4 Unicast: advertised and received Hostname Capability: advertised (name: frr,domain name: n/a) received (name: frr,domain name: n/a) Version Capability: not advertised not received Graceful Restart Capability: advertised and received Remote Restart timer is 120 seconds Address families by peer: none Graceful restart information: End-of-RIB send: IPv4 Unicast End-of-RIB received: IPv4 Unicast Local GR Mode: Helper* Remote GR Mode: Helper R bit: False N bit: False Timers: Configured Restart Time(sec): 120 Received Restart Time(sec): 120 Configured LLGR Stale Path Time(sec): 0 IPv4 Unicast: F bit: False End-of-RIB sent: Yes End-of-RIB sent after update: Yes End-of-RIB received: Yes Timers: Configured Stale Path Time(sec): 360 LLGR Stale Path Time(sec): 0 Message statistics: Inq depth is 0 Outq depth is 0 Sent Rcvd Opens: 5 5 Notifications: 2 4 Updates: 13 15 Keepalives: 8146 8146 Route Refresh: 0 0 Capability: 0 0 Total: 8166 8170 Minimum time between advertisement runs is 0 seconds Update source is 77.0.0.7

For address family: IPv4 Unicast Update group 5, subgroup 5 Packet Queue length 0 Community attribute sent to this neighbor(all) 4 accepted prefixes

Connections established 5; dropped 4 Last reset 00:05:01, No AFI/SAFI activated for peer (n/a) External BGP neighbor may be up to 1 hops away. Local host: 77.0.0.7, Local port: 179 Foreign host: 66.0.0.6, Foreign port: 36745 Nexthop: 77.0.0.7 Nexthop global: :: Nexthop local: :: BGP connection: non shared network BGP Connect Retry Timer in Seconds: 120 Estimated round trip time: 3 ms Read thread: on Write thread: on FD used: 30

Expected behavior

The EBGP session should not be established when it is not directly connected and doesnt have 'disable-connected-check' / 'ebgp-multihop' CLIs configured.

Actual behavior

EBGP MH Session gets established.

Additional context

Root-cause: There doesn’t seem to be a check for multi-hop/connected check while accepting a connection in bgp_accept. When the DUT initiates a connection with bgp_start, there is a check.

When DUT initiates: 2024/04/17 21:52:29 BGP: [ZWCSR-M7FG9] 66.0.0.6 [FSM] BGP_Start (Idle->Connect), fd -1 2024/04/17 21:52:29 BGP: [S7AHN-X0695] 66.0.0.6 [FSM] Waiting for NHT, no path to neighbor present

When DUT accepts connection 2024/04/17 21:58:14 BGP: [T04AP-5W1P3] [Event] connection from 66.0.0.6 fd 30, active peer status 3 fd -1 2024/04/17 21:58:14 BGP: [HKWM3-ZC5QP] 66.0.0.6 fd 30 went from Idle to Active 2024/04/17 21:58:14 BGP: [ZWCSR-M7FG9] 66.0.0.6 [FSM] TCP_connection_open (Active->OpenSent), fd 30 2024/04/17 21:58:14 BGP: [WECS1-Q4P17] 66.0.0.6 passive open 2024/04/17 21:58:14 BGP: [XKJ09-9VTZ7] 66.0.0.6 Sending hostname cap with hn = frr, dn = (null) 2024/04/17 21:58:14 BGP: [N7XW0-DHZ4E] [BGP_GR] Sending helper Capability for Peer :66.0.0.6 : 2024/04/17 21:58:14 BGP: [NCVS7-K8XXB] [BGP_GR] Sending N-Bit for peer: 66.0.0.6 2024/04/17 21:58:14 BGP: [JFFAN-DEGED] 66.0.0.6 sending OPEN, version 4, my as 200, holdtime 180, id 77.0.0.7 … 2024/04/17 21:58:14 BGP: [HKWM3-ZC5QP] 66.0.0.6 fd 30 went from OpenConfirm to Established

===================================================================== Kindly advise if this behaviour is expected or a bug.

Checklist

ton31337 commented 4 months ago

That's expected because loopback sits on the same device. If you try to do a session r1 -- r2 -- r3, between r1 and r3, you won't be able to establish a session without ebgp-multihop.

samanvithab commented 4 months ago

I undestand regarding ebgp-multihop. But we should need 'disable-connected-check' configured atleast to establish connection in this case?

As per our FRR user guide: neighbor PEER disable-connected-check

Allow peerings between directly connected eBGP peers using loopback addresses.

So if above is not configured, implicit behavior should be that it should not establish connection?

Also, like I mentioned behavior is different in bgp_start & bgp_accept. While initiating the connection we have a connected check & throw an error, but while accepting connection we donot have a connected check.

l0crian1 commented 2 months ago

I ran into this issue as well when helping someone with an issue they were having. The behavior is not as simple as the peering comes up when it shouldn't, it also causes different behavior with and without it.

The user was attempting to configure an MPLS L3VPN. Without ebgp-multihop, the prefix in their VRF was inactive and looked like this:

Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

VRF T1:
B   150.0.0.1/32 [20/0] via 6.1.2.217 (vrf default) inactive, label 17, weight 1, 00:12:38

But with ebgp-multihop (or disable-connected-check), the prefix was valid and had a transport/vpn label as expected.

Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

VRF T1:
B>  150.0.0.1/32 [20/0] via 6.1.2.217 (vrf default) (recursive), label 17, weight 1, 00:00:02
  *                       via 169.254.66.90, tun5002 (vrf default), label implicit-null/17, weight 1, 00:00:02
nandini660 commented 1 month ago

I gone through this issue, We can create an ebgp session by using a static peer connection instead of the ebgp-multihop/disable-connected-check command. Considering that there won't be any impact on TTL in this scenario, why is multihop enabled or disable-connected-check necessary when there isn't an intermediary router present?

R1 ip route 67.0.0.6/32 192.18.0.3 interface lo ip address 76.0.0.7/32 router bgp 100 neighbor 67.0.0.6 remote-as 200 neighbor 67.0.0.6 update-source 76.0.0.7

R2 ip route 76.0.0.7/32 192.18.0.2 interface lo ip address 67.0.0.6/32 router bgp 200 neighbor 76.0.0.7 remote-as 100 neighbor 76.0.0.7 update-source 67.0.0.6

msysfeet commented 1 month ago

@ton31337, @l0crian1

Could you please explain, why this is a buggy scenario?

ton31337 commented 1 month ago

Where do you see a bug here?

l0crian1 commented 1 month ago

@ton31337, @l0crian1

Could you please explain, why this is a buggy scenario?

I don't know if I'd fully classify it as a bug, but the behavior is inconsistent with the behavior of other products, and inconsistent with FRR's documentation. (https://docs.frrouting.org/en/latest/bgp.html#clicmd-neighbor-PEER-ebgp-multihop)

Typically, without ebgp-multihop configured, an eBGP session will not come up. The current documentation states:

Specifying ebgp-multihop allows sessions with eBGP neighbors to establish when they are multiple hops away. When the neighbor is not directly connected and this knob is not enabled, the session will not establish.

Furthermore, the documentation also states this for disabling connected checks, but clearly works without disable-connected-check:

Allow peerings between directly connected eBGP peers using loopback addresses.

The behavior is inconsistent with industry, and even though it does come up without any expected additional commands, the routing is not valid without those additional commands. When the session comes up, it may not be immediately apparent the reason for the inactive routes, leading to unnecessary troubleshooting.

Just my opinion, but this behavior should mirror the majority of industry's behavior, and the session shouldn't even come up without eBGP multihop or the disabling of connected checks.

nandini660 commented 1 month ago

According to my observations of dynamic peer behavior, an ebgp session won't comes up if ebgp-multihop or disable-connected-check are not used. can you please share the details for reproducing this bug.