FRRouting / frr

The FRRouting Protocol Suite
https://frrouting.org/
Other
3.33k stars 1.25k forks source link

FRR8.0: bgp view no longer works #9519

Closed pfsinoz closed 3 years ago

pfsinoz commented 3 years ago

Description: bgp view no longer works as from FRR 8.0

Platform: Ubuntu 18.04 and Ubuntu 20.04, both running 5.4.0-81-generic kernel.

Comments

I've used the bgp view configuration for many months on FRR 7.5.1, no issues at all. The recent upgrade to FRR 8.0.0 broke all the peerings - BGP fails to come up, with the sessions sitting in "Connect" and nothing more. There are no log messages I can find that indicate anything is amiss. And there is nothing on the other side either - in this case the Cisco router just cycles through trying to establish the BGP session.

I've tried FRR 8.0.1 released on the 29th July, no improvement over 8.0.

And I've also tried frr_8.1-dev-master-g26bf593-20210828.132044-1~ubuntu18.04.1_amd64.deb without any luck either.

Sample Configuration

Configuration taken from my testbed, as an example:

router bgp 64512 view C1100
 bgp router-id 192.168.1.230
 bgp log-neighbor-changes
 no bgp default ipv4-unicast
 bgp deterministic-med
 timers bgp 15 45
 neighbor 192.168.1.254 remote-as 64512
 neighbor 192.168.1.254 description Cisco C1100 Border
 neighbor 2001:db8:1:1::ffff remote-as 64512
 neighbor 2001:db8:1:1::ffff description Cisco C1100 Border
 !
 address-family ipv4 unicast
  distance bgp 200 200 200
  neighbor 192.168.1.254 activate
 exit-address-family
 !
 address-family ipv6 unicast
  distance bgp 200 200 200
  neighbor 2001:db8:1:1::ffff activate
 exit-address-family
! 

Using FRR 7.5.1, the peerings with 192.168.1.254 and 2001:db8:1:1::ffff establish without any issue - for example:

frr# sh bgp view C1100 sum

IPv4 Unicast Summary:
BGP router identifier 192.168.1.230, local AS number 64512 vrf-id 1
BGP table version 0
RIB entries 0, using 0 bytes of memory
Peers 1, using 21 KiB of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt
192.168.1.254   4      64512      1378      1282        0    0    0 05:20:08            0        0

Total number of neighbors 1

IPv6 Unicast Summary:
BGP router identifier 192.168.1.230, local AS number 64512 vrf-id 1
BGP table version 0
RIB entries 0, using 0 bytes of memory
Peers 1, using 21 KiB of memory

Neighbor                 V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt
2001:db8:1:1::ffff       4      64512      1374      1282        0    0    0 05:20:08            0        0

Total number of neighbors 1
mwinter-osr commented 3 years ago

Can you try with adding no bgp ebgp-requires-policy under router BGP ? Not sure when this was added, but a newer RFC which we support requires a policy as a condition to bring up the peer.

pfsinoz commented 3 years ago

7.5.1 required the ebgp policy by default (yay!) so my live running configs have policy in them. The test bed does too, but I was a bit too aggressive at redacting bits - so no redacting now!

So I've tried no bgp ebgp-requires-policy with 8.1-dev-master-g26bf593-20210828.132044-1 - no difference:

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
192.168.1.254   4     131076         0         0        0    0    0    never      Connect        0 Cisco

Neighbor                 V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
2001:44b8:2134:800::ffff 4     131076         0         0        0    0    0    never      Connect        0 Cisco

Here is the complete config for reference:

router bgp 131076 view C1100
 bgp router-id 192.168.1.230
 bgp log-neighbor-changes
 no bgp ebgp-requires-policy
 no bgp suppress-duplicates
 no bgp default ipv4-unicast
 bgp deterministic-med
 timers bgp 15 45
 neighbor 192.168.1.254 remote-as 131076
 neighbor 192.168.1.254 description Cisco C1100 Border
 neighbor 2001:44b8:2134:800::ffff remote-as 131076
 neighbor 2001:44b8:2134:800::ffff description Cisco C1100 Border
 !
 address-family ipv4 unicast
  distance bgp 200 200 200
  neighbor 192.168.1.254 activate
  neighbor 192.168.1.254 route-map IMPORT in
  neighbor 192.168.1.254 route-map EXPORT out
 exit-address-family
 !
 address-family ipv6 unicast
  distance bgp 200 200 200
  neighbor 2001:44b8:2134:800::ffff activate
  neighbor 2001:44b8:2134:800::ffff route-map IMPORT in
  neighbor 2001:44b8:2134:800::ffff route-map EXPORT out
 exit-address-family
exit
!
route-map IMPORT permit 10
 description Allow everything
exit
!
route-map EXPORT deny 10
 description Deny everything
exit

The only difference between this config and the original posting is the no bgp ebgp-requires-policy has been added.

If I stop and restart FRR8.1-dev (systemctl restart frr), the logs (/var/log/frr/frr.log) say this:

Aug 31 09:24:55 frr watchfrr[21791]: [T83RR-8SM5G] watchfrr 8.1-dev starting: vty@0
Aug 31 09:24:55 frr watchfrr[21791]: [ZCJ3S-SPH5S] zebra state -> down : initial connection attempt failed
Aug 31 09:24:55 frr watchfrr[21791]: [ZCJ3S-SPH5S] bgpd state -> down : initial connection attempt failed
Aug 31 09:24:55 frr watchfrr[21791]: [ZCJ3S-SPH5S] staticd state -> down : initial connection attempt failed
Aug 31 09:24:55 frr watchfrr[21791]: [YFT0P-5Q5YX] Forked background command [pid 21792]: /usr/lib/frr/watchfrr.sh restart all
Aug 31 09:24:55 frr bgpd[21812]: [GQF43-N30BN] bgp_install_info_to_zebra: No zebra instance to talk to, not installing information
Aug 31 09:24:55 frr bgpd[21812]: message repeated 11 times: [ [GQF43-N30BN] bgp_install_info_to_zebra: No zebra instance to talk to, not installing information]
Aug 31 09:24:55 frr zebra[21807]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
Aug 31 09:24:55 frr bgpd[21812]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
Aug 31 09:24:55 frr watchfrr[21791]: [QDG3Y-BY5TN] zebra state -> up : connect succeeded
Aug 31 09:24:55 frr watchfrr[21791]: [QDG3Y-BY5TN] bgpd state -> up : connect succeeded
Aug 31 09:24:55 frr watchfrr[21791]: [QDG3Y-BY5TN] staticd state -> up : connect succeeded
Aug 31 09:24:55 frr watchfrr[21791]: [KWE5Q-QNGFC] all daemons up, doing startup-complete notify

The No zebra instance to talk to is a bit worrying...

I get the same with the released FRR 8.0.1:

Aug 27 04:42:11 frr watchfrr[11902]: [T83RR-8SM5G] watchfrr 8.0 starting: vty@0
Aug 27 04:42:11 frr watchfrr[11902]: [ZCJ3S-SPH5S] zebra state -> down : initial connection attempt failed
Aug 27 04:42:11 frr watchfrr[11902]: [ZCJ3S-SPH5S] bgpd state -> down : initial connection attempt failed
Aug 27 04:42:11 frr watchfrr[11902]: [ZCJ3S-SPH5S] staticd state -> down : initial connection attempt failed
Aug 27 04:42:11 frr watchfrr[11902]: [YFT0P-5Q5YX] Forked background command [pid 11903]: /usr/lib/frr/watchfrr.sh restart all
Aug 27 04:42:12 frr bgpd[11923]: [GQF43-N30BN] bgp_install_info_to_zebra: No zebra instance to talk to, not installing information
Aug 27 04:42:12 frr bgpd[11923]: message repeated 11 times: [ [GQF43-N30BN] bgp_install_info_to_zebra: No zebra instance to talk to, not installing information]
Aug 27 04:42:12 frr zebra[11918]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
Aug 27 04:42:12 frr bgpd[11923]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
Aug 27 04:42:12 frr watchfrr[11902]: [QDG3Y-BY5TN] zebra state -> up : connect succeeded
Aug 27 04:42:12 frr watchfrr[11902]: [QDG3Y-BY5TN] bgpd state -> up : connect succeeded
Aug 27 04:42:12 frr watchfrr[11902]: [QDG3Y-BY5TN] staticd state -> up : connect succeeded
Aug 27 04:42:12 frr watchfrr[11902]: [KWE5Q-QNGFC] all daemons up, doing startup-complete notify

Using FRR 7.5.1, the logs say this:

Aug 30 17:00:44 frr watchfrr[19697]: watchfrr 7.5.1 starting: vty@0
Aug 30 17:00:44 frr watchfrr[19697]: zebra state -> down : initial connection attempt failed
Aug 30 17:00:44 frr watchfrr[19697]: bgpd state -> down : initial connection attempt failed
Aug 30 17:00:44 frr watchfrr[19697]: staticd state -> down : initial connection attempt failed
Aug 30 17:00:44 frr watchfrr[19697]: Forked background command [pid 19698]: /usr/lib/frr/watchfrr.sh restart all
Aug 30 17:00:44 frr zebra[19713]: client 30 says hello and bids fair to announce only vnc routes vrf=0
Aug 30 17:00:44 frr zebra[19713]: client 16 says hello and bids fair to announce only bgp routes vrf=0
Aug 30 17:00:44 frr zebra[19713]: client 37 says hello and bids fair to announce only static routes vrf=0
Aug 30 17:00:44 frr watchfrr[19697]: restart all process 19698 exited with non-zero status 2
Aug 30 17:00:45 frr bgpd[19720]: [EC 33554503] 192.168.1.254 unrecognized capability code: 70 - ignored
Aug 30 17:00:45 frr bgpd[19720]: %ADJCHANGE: neighbor 192.168.1.254(Unknown) in vrf default Up
Aug 30 17:00:45 frr bgpd[19720]: [EC 33554503] 2001:44b8:2134:800::ffff unrecognized capability code: 70 - ignored
Aug 30 17:00:45 frr bgpd[19720]: %ADJCHANGE: neighbor 2001:44b8:2134:800::ffff(Unknown) in vrf default Up
Aug 30 17:00:45 frr bgpd[19720]: bgp_update_receive: rcvd End-of-RIB for IPv4 Unicast from 192.168.1.254 in vrf default
Aug 30 17:00:45 frr bgpd[19720]: bgp_update_receive: rcvd End-of-RIB for IPv6 Unicast from 2001:44b8:2134:800::ffff in vrf default
Aug 30 17:00:49 frr watchfrr[19697]: staticd state -> up : connect succeeded
Aug 30 17:00:49 frr watchfrr[19697]: bgpd state -> up : connect succeeded
Aug 30 17:00:49 frr watchfrr[19697]: zebra state -> up : connect succeeded
Aug 30 17:00:49 frr watchfrr[19697]: all daemons up, doing startup-complete notify

and the bgp view peering comes up successfully.

pfsinoz commented 3 years ago

Ooops, in my previous comment I mistakenly cut-paste the FRR 8.0 logs from a few days ago rather than the FRR 8.0.1 logs from yesterday. Here is what FRR 8.0.1 does. No change in the situation though.

Aug 30 16:32:41 frr watchfrr[17227]: [T83RR-8SM5G] watchfrr 8.0.1 starting: vty@0
Aug 30 16:32:41 frr watchfrr[17227]: [ZCJ3S-SPH5S] zebra state -> down : initial connection attempt failed
Aug 30 16:32:41 frr watchfrr[17227]: [ZCJ3S-SPH5S] bgpd state -> down : initial connection attempt failed
Aug 30 16:32:41 frr watchfrr[17227]: [ZCJ3S-SPH5S] staticd state -> down : initial connection attempt failed
Aug 30 16:32:41 frr watchfrr[17227]: [YFT0P-5Q5YX] Forked background command [pid 17228]: /usr/lib/frr/watchfrr.sh restart all
Aug 30 16:32:42 frr bgpd[17248]: [GQF43-N30BN] bgp_install_info_to_zebra: No zebra instance to talk to, not installing information
Aug 30 16:32:42 frr bgpd[17248]: message repeated 11 times: [ [GQF43-N30BN] bgp_install_info_to_zebra: No zebra instance to talk to, not installing information]
Aug 30 16:32:42 frr zebra[17243]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
Aug 30 16:32:42 frr bgpd[17248]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
Aug 30 16:32:42 frr watchfrr[17227]: [QDG3Y-BY5TN] zebra state -> up : connect succeeded
Aug 30 16:32:42 frr watchfrr[17227]: [QDG3Y-BY5TN] bgpd state -> up : connect succeeded
Aug 30 16:32:42 frr watchfrr[17227]: [QDG3Y-BY5TN] staticd state -> up : connect succeeded
Aug 30 16:32:42 frr watchfrr[17227]: [KWE5Q-QNGFC] all daemons up, doing startup-complete notify
qlyoung commented 3 years ago

Can you enable BGP neighbor event debugs and post the logs?

pfsinoz commented 3 years ago

With debug bgp neighbor-events turned on (using FRRouting 8.1-dev), I reset the two neighbours in the view. The logs say this:

Sep  1 08:07:38 frr bgpd[21812]: [ZWCSR-M7FG9] 192.168.1.254 [FSM] BGP_Stop (Connect->Idle), fd -1
Sep  1 08:07:38 frr bgpd[21812]: [T91AW-FGMHW] bgp_fsm_change_status : vrf Unknown(1), Status: Idle established_peers 0
Sep  1 08:07:38 frr bgpd[21812]: [ZQHFG-DQGX1] 192.168.1.254 went from Connect to Idle
Sep  1 08:07:38 frr bgpd[21812]: [ZWCSR-M7FG9] 2001:44b8:2134:800::ffff [FSM] BGP_Stop (Connect->Idle), fd -1
Sep  1 08:07:38 frr bgpd[21812]: [T91AW-FGMHW] bgp_fsm_change_status : vrf Unknown(1), Status: Idle established_peers 0
Sep  1 08:07:38 frr bgpd[21812]: [ZQHFG-DQGX1] 2001:44b8:2134:800::ffff went from Connect to Idle
Sep  1 08:07:39 frr bgpd[21812]: [ZQTB5-H8522] 192.168.1.254 [FSM] Timer (start timer expire).
Sep  1 08:07:39 frr bgpd[21812]: [ZWCSR-M7FG9] 192.168.1.254 [FSM] BGP_Start (Idle->Connect), fd -1
Sep  1 08:07:39 frr bgpd[21812]: [T91AW-FGMHW] bgp_fsm_change_status : vrf Unknown(1), Status: Connect established_peers 0
Sep  1 08:07:39 frr bgpd[21812]: [ZQHFG-DQGX1] 192.168.1.254 went from Idle to Connect
Sep  1 08:07:39 frr bgpd[21812]: [ZQTB5-H8522] 2001:44b8:2134:800::ffff [FSM] Timer (start timer expire).
Sep  1 08:07:39 frr bgpd[21812]: [ZWCSR-M7FG9] 2001:44b8:2134:800::ffff [FSM] BGP_Start (Idle->Connect), fd -1
Sep  1 08:07:39 frr bgpd[21812]: [T91AW-FGMHW] bgp_fsm_change_status : vrf Unknown(1), Status: Connect established_peers 0
Sep  1 08:07:39 frr bgpd[21812]: [ZQHFG-DQGX1] 2001:44b8:2134:800::ffff went from Idle to Connect
idryzhov commented 3 years ago

Your logs from 8.0 look like you're running BGP in --no-zebra mode, while logs from 7.5 look like you're running BGP in regular mode with zebra connection. Is it possible that you somehow reconfigured BGP to run in --no-zebra mode while upgrading?

pfsinoz commented 3 years ago

With 8.1-dev running, this is the ps ax output:

philip@frr:~$ ps ax | grep frr
21791 ?        Ss     0:23 /usr/lib/frr/watchfrr -d -F traditional zebra bgpd staticd
21807 ?        Ssl    0:09 /usr/lib/frr/zebra -d -F traditional -A 127.0.0.1 -s 90000000
21812 ?        Ssl    0:09 /usr/lib/frr/bgpd -d -F traditional -Z -M rpki
21819 ?        Ss     0:05 /usr/lib/frr/staticd -d -F traditional -A 127.0.0.1

and systemctl claims this:

philip@frr:~$ systemctl status frr
● frr.service - FRRouting
   Loaded: loaded (/lib/systemd/system/frr.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2021-08-31 09:24:55 AEST; 1 day 12h ago
     Docs: https://frrouting.readthedocs.io/en/latest/setup.html
  Process: 21756 ExecStop=/usr/lib/frr/frrinit.sh stop (code=exited, status=0/SUCCESS)
  Process: 21781 ExecStart=/usr/lib/frr/frrinit.sh start (code=exited, status=0/SUCCESS)
 Main PID: 21791 (watchfrr)
   Status: "FRR Operational"
    Tasks: 14 (limit: 4915)
   CGroup: /system.slice/frr.service
           ├─21791 /usr/lib/frr/watchfrr -d -F traditional zebra bgpd staticd
           ├─21807 /usr/lib/frr/zebra -d -F traditional -A 127.0.0.1 -s 90000000
           ├─21812 /usr/lib/frr/bgpd -d -F traditional -Z -M rpki
           └─21819 /usr/lib/frr/staticd -d -F traditional -A 127.0.0.1

If I go back to 7.5.1 without changing anything else, the BGP sessions in the views come up. If I remove the view in 8.0 or 8.1-dev, using this config:

router bgp 131076
 bgp router-id 192.168.1.230
 bgp log-neighbor-changes
 no bgp ebgp-requires-policy
 no bgp suppress-duplicates
 no bgp default ipv4-unicast
 bgp deterministic-med
 timers bgp 15 45
 neighbor 192.168.1.254 remote-as 131076
 neighbor 192.168.1.254 description Cisco C1100 Border
 neighbor 2001:44b8:2134:800::ffff remote-as 131076
 neighbor 2001:44b8:2134:800::ffff description Cisco C1100 Border
<snip>

the BGP sessions come up straight away:

frr.pfs.name# sh bgp sum

IPv4 Unicast Summary (VRF default):
BGP router identifier 192.168.1.230, local AS number 131076 vrf-id 0
BGP table version 0
RIB entries 0, using 0 bytes of memory
Peers 1, using 723 KiB of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
192.168.1.254   4     131076         7         4        0    0    0 00:00:18            0        0 Cisco

Total number of neighbors 1

IPv6 Unicast Summary (VRF default):
BGP router identifier 192.168.1.230, local AS number 131076 vrf-id 0
BGP table version 0
RIB entries 0, using 0 bytes of memory
Peers 1, using 723 KiB of memory

Neighbor                 V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
2001:44b8:2134:800::ffff 4     131076         6         4        0    0    0 00:00:18            0        0 Cisco

Total number of neighbors 1

and the debug event logs are rather long, showing the successful establishment of the BGP sessions:

Sep  1 21:48:59 frr watchfrr[24400]: [YFT0P-5Q5YX] Forked background command [pid 24401] : /usr/lib/frr/watchfrr.sh restart all
Sep  1 21:48:59 frr zebra[24416]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
Sep  1 21:48:59 frr bgpd[24421]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
Sep  1 21:48:59 frr watchfrr[24400]: [QDG3Y-BY5TN] zebra state -> up : connect succeeded
Sep  1 21:48:59 frr watchfrr[24400]: [QDG3Y-BY5TN] bgpd state -> up : connect succeeded
Sep  1 21:48:59 frr watchfrr[24400]: [QDG3Y-BY5TN] staticd state -> up : connect succeeded
Sep  1 21:48:59 frr watchfrr[24400]: [KWE5Q-QNGFC] all daemons up, doing startup-complete notify
Sep  1 21:49:00 frr bgpd[24421]: [ZQTB5-H8522] 192.168.1.254 [FSM] Timer (start timer expire).
Sep  1 21:49:00 frr bgpd[24421]: [ZWCSR-M7FG9] 192.168.1.254 [FSM] BGP_Start (Idle->Connect), fd -1
Sep  1 21:49:00 frr bgpd[24421]: [T72VK-55DVG] 192.168.1.254 [FSM] Waiting for NHT
Sep  1 21:49:00 frr bgpd[24421]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: Connect established_peers 0
Sep  1 21:49:00 frr bgpd[24421]: [ZQHFG-DQGX1] 192.168.1.254 went from Idle to Connect
Sep  1 21:49:00 frr bgpd[24421]: [ZQTB5-H8522] 2001:44b8:2134:800::ffff [FSM] Timer (start timer expire).
Sep  1 21:49:00 frr bgpd[24421]: [ZWCSR-M7FG9] 2001:44b8:2134:800::ffff [FSM] BGP_Start (Idle->Connect), fd -1
Sep  1 21:49:00 frr bgpd[24421]: [T72VK-55DVG] 2001:44b8:2134:800::ffff [FSM] Waiting for NHT
Sep  1 21:49:00 frr bgpd[24421]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: Connect established_peers 0
Sep  1 21:49:00 frr bgpd[24421]: [ZQHFG-DQGX1] 2001:44b8:2134:800::ffff went from Idle to Connect
Sep  1 21:49:00 frr bgpd[24421]: [ZWCSR-M7FG9] 192.168.1.254 [FSM] TCP_connection_open_failed (Connect->Active), fd -1
Sep  1 21:49:00 frr bgpd[24421]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: Active established_peers 0
Sep  1 21:49:00 frr bgpd[24421]: [ZQHFG-DQGX1] 192.168.1.254 went from Connect to Active
Sep  1 21:49:00 frr bgpd[24421]: [ZWCSR-M7FG9] 2001:44b8:2134:800::ffff [FSM] TCP_connection_open_failed (Connect->Active), fd -1
Sep  1 21:49:00 frr bgpd[24421]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: Active established_peers 0
Sep  1 21:49:00 frr bgpd[24421]: [ZQHFG-DQGX1] 2001:44b8:2134:800::ffff went from Connect to Active
Sep  1 21:49:00 frr bgpd[24421]: [ZWCSR-M7FG9] 192.168.1.254 [FSM] ConnectRetry_timer_expired (Active->Connect), fd -1
Sep  1 21:49:00 frr bgpd[24421]: [Z195V-FNKRK] 192.168.1.254 [Event] Connect start to 192.168.1.254 fd 28
Sep  1 21:49:00 frr bgpd[24421]: [G0837-S7QES] 192.168.1.254 [FSM] Non blocking connect waiting result, fd 28
Sep  1 21:49:00 frr bgpd[24421]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: Connect established_peers 0
Sep  1 21:49:00 frr bgpd[24421]: [ZQHFG-DQGX1] 192.168.1.254 went from Active to Connect
Sep  1 21:49:00 frr bgpd[24421]: [ZWCSR-M7FG9] 2001:44b8:2134:800::ffff [FSM] ConnectRetry_timer_expired (Active->Connect), fd -1
Sep  1 21:49:00 frr bgpd[24421]: [Z195V-FNKRK] 2001:44b8:2134:800::ffff [Event] Connect start to 2001:44b8:2134:800::ffff fd 29
Sep  1 21:49:00 frr bgpd[24421]: [G0837-S7QES] 2001:44b8:2134:800::ffff [FSM] Non blocking connect waiting result, fd 29
Sep  1 21:49:00 frr bgpd[24421]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: Connect established_peers 0
Sep  1 21:49:00 frr bgpd[24421]: [ZQHFG-DQGX1] 2001:44b8:2134:800::ffff went from Active to Connect
Sep  1 21:49:00 frr bgpd[24421]: [ZWCSR-M7FG9] 192.168.1.254 [FSM] TCP_connection_open (Connect->OpenSent), fd 28
Sep  1 21:49:00 frr bgpd[24421]: [RWZTG-AA74G] 192.168.1.254 open active, local address 192.168.1.230
Sep  1 21:49:00 frr bgpd[24421]: [XKJ09-9VTZ7] 192.168.1.254 Sending hostname cap with hn = frr, dn = (null)
Sep  1 21:49:00 frr bgpd[24421]: [JFFAN-DEGED] 192.168.1.254 sending OPEN, version 4, my as 131076, holdtime 45, id 192.168.1.230
Sep  1 21:49:00 frr bgpd[24421]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: OpenSent established_peers 0
Sep  1 21:49:00 frr bgpd[24421]: [ZQHFG-DQGX1] 192.168.1.254 went from Connect to OpenSent
Sep  1 21:49:00 frr bgpd[24421]: [ZWCSR-M7FG9] 2001:44b8:2134:800::ffff [FSM] TCP_connection_open (Connect->OpenSent), fd 29
Sep  1 21:49:00 frr bgpd[24421]: [RWZTG-AA74G] 2001:44b8:2134:800::ffff open active, local address 2001:44b8:2134:800::230
Sep  1 21:49:00 frr bgpd[24421]: [XKJ09-9VTZ7] 2001:44b8:2134:800::ffff Sending hostname cap with hn = frr, dn = (null)
Sep  1 21:49:00 frr bgpd[24421]: [JFFAN-DEGED] 2001:44b8:2134:800::ffff sending OPEN, version 4, my as 131076, holdtime 45, id 192.168.1.230
Sep  1 21:49:00 frr bgpd[24421]: [T91AW-FGMHW] bgp_fsm_change_status : vrf default(0), Status: OpenSent established_peers 0
Sep  1 21:49:00 frr bgpd[24421]: [ZQHFG-DQGX1] 2001:44b8:2134:800::ffff went from Connect to OpenSent
Sep  1 21:49:00 frr bgpd[24421]: [S2V7Q-ZTN00] 192.168.1.254 rcv OPEN, version 4, remote-as (in open) 23456, holdtime 45, id 59.167.217.120
<snip>

If I move back to using the view, no session established, and logs as I showed previously. Hope this helps!

idryzhov commented 3 years ago

Right, so you're running bgpd with -Z flag, which is no-zebra mode. So a couple of questions:

  1. Are you sure you're using this flag when running 7.5? Could you please check ps ax on 7.5?
  2. Are you using this flag on purpose? If not, I can suggest removing this flag from bgpd in /etc/frr/daemons and check if this helps.
pfsinoz commented 3 years ago

Heh, heh, I was unquestioningly following configuration examples from folks doing exactly the same thing when I set this up many months ago. ;-)

This is the production system running 7.5.1:

philip@frr:~$ ps ax | grep frr
    686 ?        S<s    3:35 /usr/lib/frr/watchfrr -d -F traditional zebra bgpd staticd
    720 ?        S<sl   3:33 /usr/lib/frr/zebra -d -F traditional -A 127.0.0.1 -s 90000000
    739 ?        S<sl 825:04 /usr/lib/frr/bgpd -d -F traditional -Z -M rpki
    750 ?        S<s    0:41 /usr/lib/frr/staticd -d -F traditional -A 127.0.0.1

The -Z flag is there, and all my 30 different views work! And -Z is in the /etc/frr/daemons file too.

I've removed the -Z flag from the /etc/frr/daemons now in the 8.1-dev testbed. And the sessions in the views are now up.

frr.pfs.name# sh bgp view C1100 sum

IPv4 Unicast Summary (VIEW C1100):
BGP router identifier 192.168.1.230, local AS number 131076 vrf-id -1
BGP table version 0
RIB entries 0, using 0 bytes of memory
Peers 1, using 723 KiB of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
192.168.1.254   4     131076         4         2        0    0    0 00:00:08            0        0 Cisco

Total number of neighbors 1

IPv6 Unicast Summary (VIEW C1100):
BGP router identifier 192.168.1.230, local AS number 131076 vrf-id -1
BGP table version 0
RIB entries 0, using 0 bytes of memory
Peers 1, using 723 KiB of memory

Neighbor                 V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
2001:44b8:2134:800::ffff 4     131076         4         2        0    0    0 00:00:08            0        0 Cisco

Total number of neighbors 1

Which then makes me wonder:

  1. Why do the views work fine with -Z in 7.5.1
  2. Why do the sessions without a view specified work (probably me not understanding what the no-zebra option is truly for)

I'm speculating a bug in the older versions that was fixed with the 8.0 release? Just curious, would love to know...

idryzhov commented 3 years ago
  1. I quickly looked at the code and I don't see any obvious difference between 7.5 and 8.0 when using this -Z flag. But obviously, something was changed in 8.0 that prevents it from working.
  2. The answer is very simple – the code treats BGP views differently when using -Z option. To be honest, the code looks like -Z was never supposed to work with views, so I don't really know why it worked for you in 7.5 :)

Finalizing the discussion: -Z is needed to allow BGP to work when you're not running zebra daemon. IIUC it's something legacy and not well-tested, and you're running FRR with zebra, so just don't use the flag and you should be fine :) In the meantime, we should find out why it was broken in 8.0, but not sure that it will happen soon because I don't think people use it often.

pfsinoz commented 3 years ago

Thank you so much for the help and clarification! Glad that this is all working again. Is it worth noting in the docs just to not use -Z at all (is there ever a need to not use zebra?), just in case it catches anyone else out in future? (I suspected not many would be using the bgp view feature - it's a fairly niche application.) Thank you again!

pfsinoz commented 3 years ago

Just tested the removal of -Z on the production system. The views fail to establish. Investigating, and will revert in a few hours.

donaldsharp commented 3 years ago

can we get a show bgp view <your view> ipv4 uni summ fail output?

donaldsharp commented 3 years ago

I think I figured it out. Please check out this PR https://github.com/FRRouting/frr/pull/9538

pfsinoz commented 3 years ago

On production system I now have:

frr.nsrc.org# sh ver
FRRouting 8.0.1 (frr.nsrc.org).
Copyright 1996-2005 Kunihiro Ishiguro, et al.

and here is the show output:

frr.nsrc.org# sh bgp view I2 ipv4 sum fail

IPv4 Unicast Summary:
BGP router identifier 128.223.157.40, local AS number 65534 vrf-id -1
BGP table version 0
RIB entries 0, using 0 bytes of memory
Peers 1, using 723 KiB of memory

Neighbor        EstdCnt DropCnt ResetTime Reason
64.57.28.241          0       0     never No AFI/SAFI activated for peer

Total number of neighbors 1

with IPv6 showing similar:

frr.nsrc.org# sh bgp view I2 ipv6 sum fail

IPv6 Unicast Summary:
BGP router identifier 128.223.157.40, local AS number 65534 vrf-id -1
BGP table version 0
RIB entries 0, using 0 bytes of memory
Peers 1, using 723 KiB of memory

Neighbor        EstdCnt DropCnt ResetTime Reason
2001:468:2::1         0       0     never No AFI/SAFI activated for peer

Total number of neighbors 1

and this is the configuration used:

router bgp 65534 view I2
 bgp router-id 128.223.157.40
 bgp log-neighbor-changes
 no bgp suppress-duplicates
 no bgp default ipv4-unicast
 timers bgp 15 45
 neighbor 64.57.28.241 remote-as 11537
 neighbor 64.57.28.241 description I2
 neighbor 64.57.28.241 ebgp-multihop 255
 neighbor 2001:468:2::1 remote-as 11537
 neighbor 2001:468:2::1 description I2
 neighbor 2001:468:2::1 ebgp-multihop 255
 neighbor 2001:468:2::1 update-source 2607:8400:2880:4::80df:9d28
 !
 address-family ipv4 unicast
  neighbor 64.57.28.241 activate
  neighbor 64.57.28.241 prefix-list DENY-ALL out
  neighbor 64.57.28.241 route-map IMPORT in
 exit-address-family
 !
 address-family ipv6 unicast
  neighbor 2001:468:2::1 activate
  neighbor 2001:468:2::1 prefix-list DENY-ALLv6 out
  neighbor 2001:468:2::1 route-map IMPORT in
 exit-address-family
!

I've got 30 views like this right now.

pfsinoz commented 3 years ago

I've now tried the code built after the commit of PR #9538 on the production system with its 30 views. Still no luck.

root@frr:/home/philip/frr-dev# apt list --installed | grep frr
frr-pythontools/now 8.1-dev-PR9538-g65d9b96-20210902.005749-1~ubuntu20.04.1 all [installed,local]
frr-rpki-rtrlib/now 8.1-dev-PR9538-g65d9b96-20210902.005749-1~ubuntu20.04.1 amd64 [installed,local]
frr/now 8.1-dev-PR9538-g65d9b96-20210902.005749-1~ubuntu20.04.1 amd64 [installed,local]

This is what is running:

 954247 ?        S<s    0:00 /usr/lib/frr/watchfrr -d -F traditional zebra bgpd staticd
 954263 ?        S<sl   0:01 /usr/lib/frr/zebra -d -F traditional -A 127.0.0.1 -s 90000000
 954270 ?        R<sl   0:13 /usr/lib/frr/bgpd -d -F traditional -M rpki
 954280 ?        S<s    0:00 /usr/lib/frr/staticd -d -F traditional -A 127.0.0.1

showing this version:

frr.nsrc.org# sh ver
FRRouting 8.1-dev (frr.nsrc.org).
Copyright 1996-2005 Kunihiro Ishiguro, et al.

with the views still failing, for example:

frr.nsrc.org# sh bgp view I2 sum fail

IPv4 Unicast Summary (VIEW I2):
BGP router identifier 128.223.157.40, local AS number 65534 vrf-id -1
BGP table version 0
RIB entries 0, using 0 bytes of memory
Peers 1, using 723 KiB of memory

Neighbor        EstdCnt DropCnt ResetTime Reason
64.57.28.241          0       0     never No AFI/SAFI activated for peer

Displayed neighbors 1
Total number of neighbors 1

IPv6 Unicast Summary (VIEW I2):
BGP router identifier 128.223.157.40, local AS number 65534 vrf-id -1
BGP table version 0
RIB entries 0, using 0 bytes of memory
Peers 1, using 723 KiB of memory

Neighbor        EstdCnt DropCnt ResetTime Reason
2001:468:2::1         0       0     never No AFI/SAFI activated for peer

Displayed neighbors 1
Total number of neighbors 1

Reverted to 7.5.1 on production system, and all is good again.

I went back to my testbed to try work out why that seemed to work after removing the -Z option. On there I only had one view set up, and the neighbouring router was on the same subnet as the FRR VM. I set up a second view to a router on a different subnet, and the new view failed to come up with 8.0.1 and the 8.1-dev-PR9538. I reverted to 7.5.1 on the testbed, and both views came up.

idryzhov commented 3 years ago

@pfsinoz PR https://github.com/FRRouting/frr/pull/9538 was updated and views should work correctly now both with and without -Z option. Would be great if you could check it out.

pfsinoz commented 3 years ago

Success! On my testbed system, using:

frr/now 8.1-dev-PR9538-g888e727-20210902.131948-1~ubuntu18.04.1 amd64 [installed,local]
frr-pythontools/now 8.1-dev-PR9538-g888e727-20210902.131948-1~ubuntu18.04.1 all [installed,local]
frr-rpki-rtrlib/now 8.1-dev-PR9538-g888e727-20210902.131948-1~ubuntu18.04.1 amd64 [installed,local]

the views all establish! 👍 So I've now deployed on the production system, and confirmed, the views all establish! 👍 Thank you so much!!!

I tried with and without -Z on the testbed system (success both times) - the production system has had the -Z removed as discussed earlier.

I'll read through the PR to see if I can understand what broke/needed to be fixed.

And I've just noticed that the RPKI validation state is now in the output (displaying like Cisco IOS):

BGP table version is 857454, local router ID is 128.223.157.40, vrf id 0
Default local pref 100, local AS 65534
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

   Network          Next Hop            Metric LocPrf Weight Path
V*> 1.0.0.0/24       103.245.243.1            0             0 17660 13335 i
V*> 1.0.4.0/22       103.245.243.1            0             0 17660 4826 38803 i
V*> 1.0.4.0/24       103.245.243.1            0             0 17660 4826 38803 i
V*> 1.0.5.0/24       103.245.243.1            0             0 17660 4826 38803 i
V*> 1.0.6.0/24       103.245.243.1            0             0 17660 4826 38803 i
V*> 1.0.7.0/24       103.245.243.1            0             0 17660 4826 38803 i
N*> 1.0.64.0/18      103.245.243.1            0             0 17660 2914 2516 7670 18144 i
<snip>

which is also very nice to see as well. 👍

idryzhov commented 3 years ago

Perfect! Thank you very much for the testing.