FRRouting / frr

The FRRouting Protocol Suite
https://frrouting.org/
Other
3.28k stars 1.24k forks source link

FRR 7.3.1 cannot establish BGP session to Cisco ASR1002 w/ IOS 16.03 #6915

Closed notsethw closed 4 years ago

notsethw commented 4 years ago

Describe the bug

I am trying to get a very basic BGP session between FRR and a Cisco ASR established over a IPSEC/VTI tunnel. I can ping over the tunnel just fine, and static routing over the tunnel works. The provider has given us an internal AS (64597) to use, and we are peering with their external AS.

When I try to bring up BGP to the Cisco, the session never establishes. The Cisco-side reports: (from a tcpdump) OPEN Message Error (2), subcode Unknown (0)

Here is the FRR/BGP configuration: (XXX is the remote-as which has been removed; I do know that the AS'es do match, though) Our VTI IP = 172.29.4.171; provider VTI IP = 172.29.4.170

!
router bgp 64597
 bgp router-id 10.254.1.100
 bgp log-neighbor-changes
 neighbor 172.29.4.170 remote-as XXX
 neighbor 172.29.4.170 description [redacted]
 neighbor 172.29.4.170 update-source 172.29.4.171
 neighbor 172.29.4.170 password [redacted]
 neighbor 172.29.4.170 ebgp-multihop 255
!
 address-family ipv4 unicast
    network X.X.X.X/32
    redistribute kernel
    redistribute connected
    redistribute static
    neighbor 172.29.4.170 activate
    no neighbor 172.29.4.170 send-community
    neighbor 172.29.4.170 prefix-list [redacted] in
    neighbor 172.29.4.170 prefix-list [redacted] out
  exit-address-family
!

tcpdump of session trying to establish:

01:06:15.131468 IP (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto TCP (6), length 155)
    172.29.4.171.179 > 172.29.4.170.46544: Flags [P.], cksum 0x2b92 (correct), seq 1:96, ack 58, win 65535, options [nop,nop,md5 valid], length 95: BGP
    Open Message (1), length: 95
      Version 4, my AS 64597, Holdtime 180s, ID 10.254.6.177
      Optional parameters, length: 66
        Option Capabilities Advertisement (2), length: 6
          Multiprotocol Extensions (1), length: 4
        AFI IPv4 (1), SAFI Unicast (1)
        Option Capabilities Advertisement (2), length: 2
          Route Refresh (Cisco) (128), length: 0
        Option Capabilities Advertisement (2), length: 2
          Route Refresh (2), length: 0
        Option Capabilities Advertisement (2), length: 6
          32-Bit AS Number (65), length: 4
         4 Byte AS 64597
        Option Capabilities Advertisement (2), length: 6
          Multiple Paths (69), length: 4
        AFI IPv4 (1), SAFI Unicast (1), Send/Receive: Receive
        Option Capabilities Advertisement (2), length: 26
          Unknown (73), length: 24
        no decoder for Capability 73
        0x0000:  1670 6653 656e 7365 3031 2e61 7474 2e69
        0x0010:  6e74 6572 6e61 6c00
        Option Capabilities Advertisement (2), length: 4
          Graceful Restart (64), length: 2
        Restart Flags: [none], Restart Time 120s
01:06:15.148456 IP (tos 0xc0, ttl 1, id 18558, offset 0, flags [DF], proto TCP (6), length 81)
    172.29.4.170.46544 > 172.29.4.171.179: Flags [P.], cksum 0xa5df (correct), seq 58:79, ack 96, win 16289, options [md5 valid,eol], length 21: BGP
    Notification Message (3), length: 21, OPEN Message Error (2), subcode Unknown (0)

If I add dont-capability-negotiate to the neighbor configuration, the session actually seems to partially establish. At that point, FRR/bgpd responds to the remote side with:

172.29.1.171.31470 > 172.29.1.170.179: Flags [P.], cksum 0x0b58 (correct), seq 49:70, ack 245, win 65535, options [nop,nop,md5 valid], length 21: BGP
       Notification Message (3), length: 21, UPDATE Message Error (3), subcode Malformed AS_PATH (11)

While using dont-capability-negotiate, the session actually goes into the "Active" state and I can see (tcpdump) that their router is trying to send us routes.

show ip bgp sum:

[redacted]> show bgp summary 

IPv4 Unicast Summary:
BGP router identifier 10.254.6.177, local AS number 64597 vrf-id 0
BGP table version 32
RIB entries 58, using 10672 bytes of memory
Peers 2, using 27 KiB of memory

Neighbor        V         AS MsgRcvd MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd
172.16.99.2     4      65551     321     280        0    0    0 04:36:35           10
172.29.4.170    4        XXX     274     411        0    0    0    never       Active

Total number of neighbors 2

The goal here is to be able to either:

1) establish a session without using "dont-capability-negotiate" (preferred) OR 2) establish it with dont-capability-negotiate, if the _Malformed ASPATH issue can be resolved.

Versions

Additional context This is on a pfsense device, which I have support for. Support has been unable to resolve the issue thus far, so I am doing the best I can to resolve it on my own. I do not have access to the Cisco device on the other end, and the provider is being particularly unhelpful in troubleshooting. Their configuration for us is relatively straightforward though and does not seem to specify anything odd or unnecessary.

Provider configuration:

!
router bgp XXX
neighbor [PEERGROUPNAME] peer-group
neighbor [PEERGROUPNAME] remote-as 64597
neighbor [PEERGROUPNAME] description [REDACTED]
neighbor [PEERGROUPNAME] password xxxxxxxx
neighbor 172.29.4.171 peer-group [PEERGROUPNAME]
!
address-family ipv4
neighbor [PEERGROUPNAME] remove-private-as
neighbor [PEERGROUPNAME] route-map [REDACTED] in
neighbor [PEERGROUPNAME] route-map [REDACTED] out
!
neighbor 172.29.4.171 activate
end

I've tried to be as detailed (but brief) as possible in explaining this issue. i apologize if I've missed any necessary details. Thanks!

ton31337 commented 4 years ago

Could you show show ip route 172.29.4.170 from the FRR node?

notsethw commented 4 years ago
Copyright 1996-2005 Kunihiro Ishiguro, et al.

[redacted]# show ip route 172.29.4.170
Routing entry for 172.29.4.168/30
  Known via "connected", distance 0, metric 1, best
  Last update 19:34:24 ago
  * directly connected, ipsec4000

As an aside: I setup a cisco CSR1000V with IOS XE 17.02.01r, on our side. I am able to talk to it from frr/bgpd without any issues, so it does appear there is something in our peer's (ASR1002 w/ IOS 16.03) configuration that is unable to handle what frr/bgpd is trying to negotiate with capabilities?

Thanks!

ton31337 commented 4 years ago

Could you turn on debug bgp updates, debug zebra updates, debug bgp neighbor-changes and paste the information here?

notsethw commented 4 years ago

hi @ton31337 ,

It looks like debug zebra updates and debug bgp neighbor-changes do not exist in my version of FRR, but debug bgp neighbor-events does. (debug bgp updates also does work; debug bgp zebra also exists)

[redacted]# show debug
Zebra debugging status:

BGP debugging status:
  BGP neighbor-events debugging is on
  BGP updates debugging is on (inbound)
  BGP updates debugging is on (outbound)
  BGP zebra debugging is on
Aug 17 17:59:33 pfSense02 bgpd[56417]: [Event] BGP connection from host 172.29.1.170 fd 21
Aug 17 17:59:33 pfSense02 bgpd[56417]: bgp_fsm_change_status : vrf default(0), Status: Active established_peers 1
Aug 17 17:59:33 pfSense02 bgpd[56417]: 172.29.1.170 went from Idle to Active
Aug 17 17:59:33 pfSense02 bgpd[56417]: 172.29.1.170 [FSM] TCP_connection_open (Active->OpenSent), fd 21
Aug 17 17:59:33 pfSense02 bgpd[56417]: 172.29.1.170 passive open
Aug 17 17:59:33 pfSense02 bgpd[56417]: 172.29.1.170 Sending hostname cap with hn = pfSense02.[redacted], dn = (null)
Aug 17 17:59:33 pfSense02 bgpd[56417]: 172.29.1.170 sending OPEN, version 4, my as 64597, holdtime 180, id 10.254.1.100
Aug 17 17:59:33 pfSense02 bgpd[56417]: bgp_fsm_change_status : vrf default(0), Status: OpenSent established_peers 1
Aug 17 17:59:33 pfSense02 bgpd[56417]: 172.29.1.170 went from Active to OpenSent
Aug 17 17:59:33 pfSense02 bgpd[56417]: 172.29.1.170 rcv OPEN, version 4, remote-as (in open) [redacted, remote AS], holdtime 180, id [redacted, remote router IP]
Aug 17 17:59:33 pfSense02 bgpd[56417]: 172.29.1.170 rcv OPEN w/ OPTION parameter len: 28
Aug 17 17:59:33 pfSense02 bgpd[56417]: 172.29.1.170 rcvd OPEN w/ optional parameter type 2 (Capability) len 6
Aug 17 17:59:33 pfSense02 bgpd[56417]: 172.29.1.170 OPEN has MultiProtocol Extensions capability (1), length 4
Aug 17 17:59:33 pfSense02 bgpd[56417]: 172.29.1.170 OPEN has MP_EXT CAP for afi/safi: IPv4/unicast
Aug 17 17:59:33 pfSense02 bgpd[56417]: 172.29.1.170 rcvd OPEN w/ optional parameter type 2 (Capability) len 2
Aug 17 17:59:33 pfSense02 bgpd[56417]: 172.29.1.170 OPEN has Route Refresh (Old) capability (128), length 0
Aug 17 17:59:33 pfSense02 bgpd[56417]: 172.29.1.170 rcvd OPEN w/ optional parameter type 2 (Capability) len 2
Aug 17 17:59:33 pfSense02 bgpd[56417]: 172.29.1.170 OPEN has Route Refresh capability (2), length 0
Aug 17 17:59:33 pfSense02 bgpd[56417]: 172.29.1.170 rcvd OPEN w/ optional parameter type 2 (Capability) len 2
Aug 17 17:59:33 pfSense02 bgpd[56417]: 172.29.1.170 OPEN has (no message found) capability (70), length 0
Aug 17 17:59:33 pfSense02 bgpd[56417]: [EC 33554503] 172.29.1.170 unrecognized capability code: 70 - ignored
Aug 17 17:59:33 pfSense02 bgpd[56417]: 172.29.1.170 rcvd OPEN w/ optional parameter type 2 (Capability) len 6
Aug 17 17:59:33 pfSense02 bgpd[56417]: 172.29.1.170 OPEN has 4-octet AS number capability (65), length 4
Aug 17 17:59:33 pfSense02 bgpd[56417]: 172.29.1.170 [FSM] Receive_OPEN_message (OpenSent->OpenConfirm), fd 21
Aug 17 17:59:33 pfSense02 bgpd[56417]: bgp_fsm_change_status : vrf default(0), Status: OpenConfirm established_peers 1
Aug 17 17:59:33 pfSense02 bgpd[56417]: 172.29.1.170 went from OpenSent to OpenConfirm
Aug 17 17:59:33 pfSense02 bgpd[56417]: %NOTIFICATION: received from neighbor 172.29.1.170 2/0 (OPEN Message Error/Unspecific) 0 bytes
Aug 17 17:59:33 pfSense02 bgpd[56417]: 172.29.1.170 [FSM] Receive_NOTIFICATION_message (OpenConfirm->Idle), fd 21
Aug 17 17:59:33 pfSense02 bgpd[56417]: bgp_fsm_change_status : vrf default(0), Status: Deleted established_peers 1
Aug 17 17:59:33 pfSense02 bgpd[56417]: 172.29.1.170 went from OpenConfirm to Deleted
Aug 17 17:59:49 pfSense02 bgpd[56417]: 172.29.1.170 [FSM] Timer (connect timer expire)
Aug 17 17:59:49 pfSense02 bgpd[56417]: 172.29.1.170 [FSM] ConnectRetry_timer_expired (Active->Connect), fd -1
Aug 17 17:59:49 pfSense02 bgpd[56417]: 172.29.1.170 [Event] Connect start to 172.29.1.170 fd 21
Aug 17 17:59:49 pfSense02 bgpd[56417]: 172.29.1.170 [FSM] Non blocking connect waiting result, fd 21
Aug 17 17:59:49 pfSense02 bgpd[56417]: bgp_fsm_change_status : vrf default(0), Status: Connect established_peers 1
Aug 17 17:59:49 pfSense02 bgpd[56417]: 172.29.1.170 went from Active to Connect
Aug 17 17:59:49 pfSense02 bgpd[56417]: 172.29.1.170 [FSM] TCP_connection_open (Connect->OpenSent), fd 21
Aug 17 17:59:49 pfSense02 bgpd[56417]: 172.29.1.170 open active, local address 172.29.1.171
Aug 17 17:59:49 pfSense02 bgpd[56417]: 172.29.1.170 Sending hostname cap with hn = pfSense02.[redacted], dn = (null)
Aug 17 17:59:49 pfSense02 bgpd[56417]: 172.29.1.170 sending OPEN, version 4, my as 64597, holdtime 180, id 10.254.1.100
Aug 17 17:59:49 pfSense02 bgpd[56417]: bgp_fsm_change_status : vrf default(0), Status: OpenSent established_peers 1
Aug 17 17:59:49 pfSense02 bgpd[56417]: 172.29.1.170 went from Connect to OpenSent
Aug 17 17:59:49 pfSense02 bgpd[56417]: %NOTIFICATION: received from neighbor 172.29.1.170 2/0 (OPEN Message Error/Unspecific) 0 bytes
Aug 17 17:59:49 pfSense02 bgpd[56417]: 172.29.1.170 [FSM] Receive_NOTIFICATION_message (OpenSent->Idle), fd 21
Aug 17 17:59:49 pfSense02 bgpd[56417]: [EC 33554465] 172.29.1.170 [FSM] unexpected packet received in state OpenSent
Aug 17 17:59:49 pfSense02 bgpd[56417]: %NOTIFICATION: sent to neighbor 172.29.1.170 5/0 (Neighbor Events Error) 0 bytes
Aug 17 17:59:49 pfSense02 bgpd[56417]: bgp_fsm_change_status : vrf default(0), Status: Idle established_peers 1
Aug 17 17:59:49 pfSense02 bgpd[56417]: 172.29.1.170 went from OpenSent to Idle
Aug 17 17:59:50 pfSense02 bgpd[56417]: 172.29.1.170 [FSM] Timer (start timer expire).
Aug 17 17:59:50 pfSense02 bgpd[56417]: 172.29.1.170 [FSM] BGP_Start (Idle->Connect), fd -1
Aug 17 17:59:50 pfSense02 bgpd[56417]: 172.29.1.170 [Event] Connect start to 172.29.1.170 fd 21
Aug 17 17:59:50 pfSense02 bgpd[56417]: 172.29.1.170 [FSM] Non blocking connect waiting result, fd 21

you might notice that these are from a different subnet than I mentioned above; we have two pfsense routers set to peer with this provider. Unfortunately, the router I was debugging from before has arbitrarily stopped writing the bgpd messages to disk, so I cant provide output from there. The output I'm providing is from the other router with the identical configuration (BGP setkey is enabled, right now)

ton31337 commented 4 years ago

Sorry, debug zebra events and debug zebra nht.

notsethw commented 4 years ago

Got it. Enabled those.

pfSense02[redacted]l# show debug
Zebra debugging status:
  Zebra event debugging is on
  Zebra next-hop tracking debugging is on

BGP debugging status:
  BGP neighbor-events debugging is on
  BGP updates debugging is on (inbound)
  BGP updates debugging is on (outbound)
  BGP zebra debugging is on

I am still not seeing any zebra-specific log messages, so hopefully this is what you are looking for:

Aug 17 19:49:54 pfSense02 bgpd[56417]: [Event] BGP connection from host 172.29.1.170 fd 25
Aug 17 19:49:54 pfSense02 bgpd[56417]: bgp_fsm_change_status : vrf default(0), Status: Active established_peers 1
Aug 17 19:49:54 pfSense02 bgpd[56417]: 172.29.1.170 went from Idle to Active
Aug 17 19:49:54 pfSense02 bgpd[56417]: 172.29.1.170 [FSM] TCP_connection_open (Active->OpenSent), fd 25
Aug 17 19:49:54 pfSense02 bgpd[56417]: 172.29.1.170 passive open
Aug 17 19:49:54 pfSense02 bgpd[56417]: 172.29.1.170 Sending hostname cap with hn = pfSense02.[redacted], dn = (null)
Aug 17 19:49:54 pfSense02 bgpd[56417]: 172.29.1.170 sending OPEN, version 4, my as 64597, holdtime 180, id 10.254.1.100
Aug 17 19:49:54 pfSense02 bgpd[56417]: bgp_fsm_change_status : vrf default(0), Status: OpenSent established_peers 1
Aug 17 19:49:54 pfSense02 bgpd[56417]: 172.29.1.170 went from Active to OpenSent
Aug 17 19:49:54 pfSense02 bgpd[56417]: 172.29.1.170 rcv OPEN, version 4, remote-as (in open) 797, holdtime 180, id [remote peer redacted]
Aug 17 19:49:54 pfSense02 bgpd[56417]: 172.29.1.170 rcv OPEN w/ OPTION parameter len: 28
Aug 17 19:49:54 pfSense02 bgpd[56417]: 172.29.1.170 rcvd OPEN w/ optional parameter type 2 (Capability) len 6
Aug 17 19:49:54 pfSense02 bgpd[56417]: 172.29.1.170 OPEN has MultiProtocol Extensions capability (1), length 4
Aug 17 19:49:54 pfSense02 bgpd[56417]: 172.29.1.170 OPEN has MP_EXT CAP for afi/safi: IPv4/unicast
Aug 17 19:49:54 pfSense02 bgpd[56417]: 172.29.1.170 rcvd OPEN w/ optional parameter type 2 (Capability) len 2
Aug 17 19:49:54 pfSense02 bgpd[56417]: 172.29.1.170 OPEN has Route Refresh (Old) capability (128), length 0
Aug 17 19:49:54 pfSense02 bgpd[56417]: 172.29.1.170 rcvd OPEN w/ optional parameter type 2 (Capability) len 2
Aug 17 19:49:54 pfSense02 bgpd[56417]: 172.29.1.170 OPEN has Route Refresh capability (2), length 0
Aug 17 19:49:54 pfSense02 bgpd[56417]: 172.29.1.170 rcvd OPEN w/ optional parameter type 2 (Capability) len 2
Aug 17 19:49:54 pfSense02 bgpd[56417]: 172.29.1.170 OPEN has (no message found) capability (70), length 0
Aug 17 19:49:54 pfSense02 bgpd[56417]: [EC 33554503] 172.29.1.170 unrecognized capability code: 70 - ignored
Aug 17 19:49:54 pfSense02 bgpd[56417]: 172.29.1.170 rcvd OPEN w/ optional parameter type 2 (Capability) len 6
Aug 17 19:49:54 pfSense02 bgpd[56417]: 172.29.1.170 OPEN has 4-octet AS number capability (65), length 4
Aug 17 19:49:54 pfSense02 bgpd[56417]: 172.29.1.170 [FSM] Receive_OPEN_message (OpenSent->OpenConfirm), fd 25
Aug 17 19:49:54 pfSense02 bgpd[56417]: bgp_fsm_change_status : vrf default(0), Status: OpenConfirm established_peers 1
Aug 17 19:49:54 pfSense02 bgpd[56417]: 172.29.1.170 went from OpenSent to OpenConfirm
Aug 17 19:49:54 pfSense02 bgpd[56417]: %NOTIFICATION: received from neighbor 172.29.1.170 2/0 (OPEN Message Error/Unspecific) 0 bytes
Aug 17 19:49:54 pfSense02 bgpd[56417]: 172.29.1.170 [FSM] Receive_NOTIFICATION_message (OpenConfirm->Idle), fd 25
Aug 17 19:49:54 pfSense02 bgpd[56417]: bgp_fsm_change_status : vrf default(0), Status: Deleted established_peers 1
Aug 17 19:49:54 pfSense02 bgpd[56417]: 172.29.1.170 went from OpenConfirm to Deleted
notsethw commented 4 years ago

Finally got some logs from the cisco side. It looks like we are advertising capabilities the cisco cannot handle:

Aug 18 15:31:55 UTC: BGP: ses global 172.29.4.171 (0x7FA8288AB720:0) act Reset (Active open failed).
Aug 18 15:31:55 UTC: BGP: 172.29.4.171 active went from Active to Idle
Aug 18 15:31:55 UTC: BGP: nbr global 172.29.4.171 Active open failed - open timer running
Aug 18 15:31:55 UTC: BGP: nbr global 172.29.4.171 Active open failed - open timer running
Aug 18 15:32:02 UTC: BGP: 172.29.4.171 active went from Idle to Active
Aug 18 15:32:02 UTC: BGP: 172.29.4.171 open active, local address 172.29.4.170
Aug 18 15:32:16 UTC: BGP: topo global:IPv4 Unicast:base Scanning routing tables
Aug 18 15:32:16 UTC: BGP: topo global:IPv4 Multicast:base Scanning routing tables
Aug 18 15:32:16 UTC: BGP: topo global:L2VPN E-VPN:base Scanning routing tables
Aug 18 15:32:16 UTC: BGP: topo global:MVPNv4 Unicast:base Scanning routing tables
Aug 18 15:32:32 UTC: BGP: 172.29.4.171 open failed: Connection timed out; remote host not responding
Aug 18 15:32:32 UTC: BGP: 172.29.4.171 Active open failed - tcb is not available, open active delayed 12288ms (35000ms max, 60% jitter)
Aug 18 15:32:32 UTC: BGP: ses global 172.29.4.171 (0x7FA8288AB720:0) act Reset (Active open failed).
Aug 18 15:32:32 UTC: BGP: 172.29.4.171 active went from Active to Idle
Aug 18 15:32:32 UTC: BGP: nbr global 172.29.4.171 Active open failed - open timer running
Aug 18 15:32:32 UTC: BGP: nbr global 172.29.4.171 Active open failed - open timer running
Aug 18 15:32:44 UTC: BGP: 172.29.4.171 active went from Idle to Active
Aug 18 15:32:44 UTC: BGP: 172.29.4.171 open active, local address 172.29.4.170
Aug 18 15:33:14 UTC: BGP: 172.29.4.171 open failed: Connection timed out; remote host not responding
Aug 18 15:33:14 UTC: BGP: 172.29.4.171 Active open failed - tcb is not available, open active delayed 9216ms (35000ms max, 60% jitter)
Aug 18 15:33:14 UTC: BGP: ses global 172.29.4.171 (0x7FA8288AB720:0) act Reset (Active open failed).
Aug 18 15:33:14 UTC: BGP: 172.29.4.171 active went from Active to Idle
Aug 18 15:33:14 UTC: BGP: nbr global 172.29.4.171 Active open failed - open timer running
Aug 18 15:33:14 UTC: BGP: nbr global 172.29.4.171 Active open failed - open timer running
Aug 18 15:33:16 UTC: BGP: topo global:IPv4 Unicast:base Scanning routing tables
Aug 18 15:33:16 UTC: BGP: topo global:IPv4 Multicast:base Scanning routing tables
Aug 18 15:33:16 UTC: BGP: topo global:L2VPN E-VPN:base Scanning routing tables
Aug 18 15:33:16 UTC: BGP: topo global:MVPNv4 Unicast:base Scanning routing tables
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive open to 172.29.4.170
Aug 18 15:33:19 UTC: BGP: Fetched peer 172.29.4.171 from tcb
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive went from Idle to Connect
Aug 18 15:33:19 UTC: BGP: ses global 172.29.4.171 (0x7FA8288AB720:0) pas Setting open delay timer to 60 seconds.
Aug 18 15:33:19 UTC: BGP: ses global 172.29.4.171 (0x7FA8288AB720:0) pas read request no-op
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive rcv message type 1, length (excl. header) 76
Aug 18 15:33:19 UTC: BGP: ses global 172.29.4.171 (0x7FA8288AB720:0) pas Receive OPEN
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive rcv OPEN, version 4, holdtime 180 seconds
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive rcv OPEN w/ OPTION parameter len: 66
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive rcvd OPEN w/ optional parameter type 2 (Capability) len 6
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive OPEN has CAPABILITY code: 1, length 4
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive OPEN has MP_EXT CAP for afi/safi: 1/1
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive rcvd OPEN w/ optional parameter type 2 (Capability) len 2
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive OPEN has CAPABILITY code: 128, length 0
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive OPEN has ROUTE-REFRESH capability(old) for all address-families
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive rcvd OPEN w/ optional parameter type 2 (Capability) len 2
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive OPEN has CAPABILITY code: 2, length 0
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive OPEN has ROUTE-REFRESH capability(new) for all address-families
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive rcvd OPEN w/ optional parameter type 2 (Capability) len 6
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive OPEN has CAPABILITY code: 65, length 4
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive OPEN has 4-byte ASN CAP for: 64597
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive rcvd OPEN w/ optional parameter type 2 (Capability) len 6
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive OPEN has CAPABILITY code: 69, length 4
Aug 18 15:33:19 UTC: BGP: ses global 172.29.4.171 (0x7FA8288AB720:0) pas Add Path not supported for EBGP nbr 172.29.4.171.
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive rcvd OPEN w/ optional parameter type 2 (Capability) len 26
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive OPEN has CAPABILITY code: 73, length 24
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive unrecognized capability code: 73
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive malformed/un-supported OPEN capability
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive went from Connect to Closing
Aug 18 15:33:19 UTC: BGP: ses global 172.29.4.171 (0x7FA8288AB720:0) pas Send NOTIFICATION 2/0 (open: unspecific subcode) 0 bytes
Aug 18 15:33:23 UTC: BGP: 172.29.4.171 passive local error close after sending NOTIFICATION
Aug 18 15:33:23 UTC: BGP: 172.29.4.171 active went from Idle to Active
Aug 18 15:33:23 UTC: BGP: 172.29.4.171 open active, local address 172.29.4.170
Aug 18 15:33:23 UTC: BGP: 172.29.4.171 passive closing
Aug 18 15:33:23 UTC: BGP: 172.29.4.171 passive went from Closing to Idle
Aug 18 15:33:53 UTC: BGP: 172.29.4.171 open failed: Connection timed out; remote host not responding
Aug 18 15:33:53 UTC: BGP: 172.29.4.171 Active open failed - tcb is not available, open active delayed 14336ms (35000ms max, 60% jitter)
Aug 18 15:33:53 UTC: BGP: ses global 172.29.4.171 (0x7FA8288B3CA8:0) act Reset (Active open failed).
Aug 18 15:33:53 UTC: BGP: 172.29.4.171 active went from Active to Idle
Aug 18 15:33:53 UTC: BGP: nbr global 172.29.4.171 Active open failed - open timer running
Aug 18 15:33:53 UTC: BGP: nbr global 172.29.4.171 Active open failed - open timer running
Aug 18 15:34:07 UTC: BGP: 172.29.4.171 active went from Idle to Active
Aug 18 15:34:07 UTC: BGP: 172.29.4.171 open active, local address 172.29.4.170

The important bit seems to be:

Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive rcvd OPEN w/ optional parameter type 2 (Capability) len 26
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive OPEN has CAPABILITY code: 73, length 24
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive unrecognized capability code: 73
Aug 18 15:33:19 UTC: BGP: 172.29.4.171 passive malformed/un-supported OPEN capability

https://www.iana.org/assignments/capability-codes/capability-codes.xhtml Code 73 – “FQDN capability”

I had mentioned in my original comment:

2) establish it with dont-capability-negotiate, if the Malformed AS_PATH issue can be resolved.

Is there any possibility for us to resolve/understand why FRR thinks the AS_PATH is malformed, when we configure "dont-capability-negotiate"?

Thanks so much

ton31337 commented 4 years ago

Are you able to test with the latest master branch? We have there already a helper which dumps all the attributes in such situations. It would show something more useful.

notsethw commented 4 years ago

hi @ton31337 ,

I am not able to test with the latest master branch as it's not available on pfsense and I don't have the time to build a machine from scratch and test it against this BGP peer.

I was able to finally get this working with "dont-capability-negotiate" enabled on both sides. Initially the vendor was stating that was not available in their version of IOS, but it turns out that it in fact was.

It does look like there is still an issue with FRR/bgpd causing some version of Cisco IOS to drop the BGP connection abruptly, which is what we were experiencing:

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCva92216

BGP session will not come up with peer if an unrecognized or unsupported capability is received from the peer in the BGP OPEN. Message similar to the following might be observed:

Aug  4 16:20:22.627: %BGP-3-NOTIFICATION: sent to neighbor 10.100.100.100 active 2/0 (open: unspecific subcode) 0 bytes
Aug  4 16:20:22.627: %BGP-4-MSGDUMP: unsupported or mal-formatted message received from 10.100.100.100:

That being said, I do not feel this is an FRR specific issue, but it does hinder the compatibility with Cisco IOS devices < 16.3.6 (when this bug was resolved).

If you'd like to close this case, I understand. Thank you for your assistance.

ton31337 commented 4 years ago

@notsethw yeah, seems pretty sure Cisco does not handle that right. Unrecognized capability MUST be ignored and not ceased the session.

ton31337 commented 4 years ago

@polychaeta autoclose in 1s.