No routes advertised for bgp-graceful-restart-deferral-time if graceful-restart is enabled

camrossi commented 2 years ago

What happened? Configure kube-rotuer to peer via eBGP to external switches. It takes the configured bgp-graceful-restart-deferral-time before any routes are advertised to the peering switches. This happens for new installation (where the adjacencies is coming up for the first time) or when restarting the kube-router pods during node maintenance.

What did you expect to happen? Routes should be advertised as soon as the BGP session is established. I tested with "pure" gobgp 3.3 and this is working as expected.

How can we reproduce the behavior you experienced?

Configure kube-router with:

- --bgp-graceful-restart=true

Disabling GR on kube-router results in the routes to be advertised immediately.

Screenshots / Architecture Diagrams / Network Topologies

                    | -- .201 Rotuer1
kube-router .192 -- |
                    | -- .202 Router2

I checked with a network trace and I can see the following (this is restarting with GR):

BGP session comes up (here I only show router1 .201)
The switch sends to kube-router all the routes
Packet 56 contains the MP_UNREACH_NLRI attribute with no withdrawn routes as per RFC
Now I would expect kube-rotuer to send me all the routes but it just wait ~300s and then it sends the routes

System Information (please complete the following information):

Kube-Router Version (kube-router --version): Running kube-router version v1.5.0-8-g88266bc2, built on 2022-06-20T16:16:31+1000, go1.17.10

Kube-Router Parameters:

    - --run-router=true
    - --run-firewall=true
    - --run-service-proxy=true
    - --bgp-graceful-restart=true
    - --kubeconfig=/var/lib/kube-router/kubeconfig
    - --cluster-asn=65003
    - --advertise-external-ip
    - --advertise-loadbalancer-ip
    - --advertise-pod-cidr=true
    - --enable-ibgp=false
    - --enable-overlay=false
    - --enable-pod-egress=false
    - --override-nexthop=true

Kubernetes Version (kubectl version) : v1.23.4
Cloud Type: on premise
Kubernetes Deployment Type: Kubeadm
Kube-Router Deployment Type: DaemonSet
Cluster Size: 200 nodes currently (we are doing some scale testing) but saw the same on a 3 nodes clusters

Additional context I tested with gobgp 3.3.0 with this config (connecting to the same switches and to the same BGP process) and there the routes are advertised immediately for new gobgp process or during GR

[global.config]
  as = 65003
  router-id = "192.168.12.222"

[[neighbors]]
  [neighbors.config]
    neighbor-address = "192.168.12.201"
    peer-as = 65002
    auth-password = "123Cisco123"
  [neighbors.graceful-restart.config]
    enabled = true
    restart-time = 120
  [[neighbors.afi-safis]]
    [neighbors.afi-safis.config]
    afi-safi-name = "ipv4-unicast"
    [neighbors.afi-safis.mp-graceful-restart.config]
        enabled = true
[[neighbors]]
  [neighbors.config]
    neighbor-address = "192.168.12.202"
    peer-as = 65002
    auth-password = "123Cisco123"
  [neighbors.graceful-restart.config]
    enabled = true
    restart-time = 120
  [[neighbors.afi-safis]]
    [neighbors.afi-safis.config]
    afi-safi-name = "ipv4-unicast"
    [neighbors.afi-safis.mp-graceful-restart.config]
        enabled = true

aauren commented 2 years ago

Hmm... I can't seem to reproduce this locally, either with Juniper equipment or via FRR. Both the first time, and whenever I restart kube-router, I see the routes come in immediately without waiting for the graceful-restart time.

I'm going to assume that you have node annotations on your nodes describing the peer.ips and peer.asns? Because otherwise, I can't get your config to work at all because kube-router never even tries to establish a peering session. I had to execute the following commands:

kubectl annotate node kube-router-vm2 "kube-router.io/peer.ips=10.241.0.10"
kubectl annotate node kube-router-vm2 "kube-router.io/peer.asns=65004"

Once I did that, I had to execute a rollout restart so that the node annotations took effect, unfortunately, kube-router doesn't watch nodes to catch these changes live yet.

Maybe one other thing I can think of, is can you show me your graceful-restart settings from gobgp within the kube-router container? To do this, you should be able to do something like:

% kubectl exec -ti -n kube-system kube-router-2prwb -- /bin/bash
...
#gobgp n                                                                                                   
Peer           AS  Up/Down State       |#Received  Accepted                                                                                                                                                                                                          
10.241.0.10 65004 00:00:42 Establ      |        0         0
#gobgp n 10.241.0.10                                                                                       
BGP neighbor is 10.241.0.10, remote AS 65004                                                                                                                                                                                                                         
  BGP version 4, remote router ID 10.241.0.10                                                                                     
  BGP state = ESTABLISHED, up for 00:00:45
...
  Neighbor capabilities:                                                                                                          
    multiprotocol:                              
        ipv4-unicast:   advertised and received                                                                                   
        ipv6-unicast:   advertised and received                                                                                   
    route-refresh:      advertised and received                                                                                   
    extended-nexthop:   advertised                                                                                                                                                                                                                                   
        Local:  nlri: ipv4-unicast, nexthop: ipv6                                                                                 
    graceful-restart:   advertised and received
        Local: restart time 90 sec                                                                                                
            ipv4-unicast                                                                                                          
            ipv6-unicast                                                                                                          
        Remote: restart time 300 sec            
            ipv4-unicast, forward flag set                                                                                        
            ipv6-unicast, forward flag set
    4-octet-as: advertised and received
...

The key to the above is that for graceful restart you should see basically one for one all of those values. If you are missing any of them, then likely your remote is incorrectly setup in some way.

My configs:

#kube-router -V
Running kube-router version v1.5.0, built on 2022-05-30T17:32:19+0000, go1.17.10

kube-router arguments:
    Args:                         
      --run-router=true          
      --run-firewall=true             
      --run-service-proxy=true     
      --bgp-graceful-restart=true                 
      --kubeconfig=/var/lib/kube-router/kubeconfig
      --runtime-endpoint=unix:///run/containerd/containerd.sock
      --cluster-asn=65003                                                                                                         
      --advertise-external-ip                                                                                                     
      --advertise-loadbalancer-ip                                                                                                 
      --advertise-pod-cidr=true                                                                                                   
      --enable-ibgp=false                                                                                                         
      --enable-overlay=false                                                                                                      
      --enable-pod-egress=false                                                                                                   
      --override-nexthop=true
      --service-external-ip-range=10.243.0.0/24

From FRR host:
# vtysh -c "show bgp detail"            
BGP table version is 9, local router ID is 10.241.0.10, vrf id 0
Default local pref 100, local AS 65004
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 10.242.0.0/24    10.241.0.20             10             0 65003 i
*> 10.242.1.0/24    10.241.0.21             10             0 65003 i
*> 10.243.0.1/32    10.241.0.21             10             0 65003 i

Displayed  3 routes and 3 total paths

# ip route
default via 10.241.0.1 dev ens3 proto dhcp src 10.241.0.10 metric 100 
10.241.0.0/16 dev ens3 proto kernel scope link src 10.241.0.10 
10.241.0.1 dev ens3 proto dhcp scope link src 10.241.0.10 metric 100 
10.242.0.0/24 via 10.241.0.20 dev ens3 proto bgp metric 20 
10.242.1.0/24 via 10.241.0.21 dev ens3 proto bgp metric 20 
10.243.0.1 via 10.241.0.21 dev ens3 proto bgp metric 20

FRR Config:
# cat /etc/frr/frr.conf                 
# default to using syslog. /etc/rsyslog.d/45-frr.conf places the log
# in /var/log/frr/frr.log
# In FRR both ! and # are considered comment characters and can be treated the same
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! Base Config for FRR as a whole         
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# Reflects defaults adhering mostly to IETF standards or common practices in wide-area internet routing
# (as opposed to datacenter which reflects a single administrative domain and uses aggressive timers)
frr defaults traditional
!
# Logs to syslog at an informational level
# (other values are: emergencies, alerts, critical, errors, warnings, notifications, informational, or debugging)
log syslog informational
!
# Puts all configuration into this single frr.conf file rather than having a separate config per daemon
service integrated-vtysh-config
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! Basic BGP config to setup neighbors and peer groups
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
router bgp 65004
  # ID ourselves as our default IPv4 Address
  bgp router-id 10.241.0.10
!
  # Consider paths of equal AS_PATH length candidates for multipath computation (without this, the entire AS_PATH must
  # match for multipath computation
  bgp bestpath as-path multipath-relax
  # Ensure that when comparing routes where both are equal on most metrics, that the tie is broken based on router ID
  bgp bestpath compare-routerid
!
  # Enable BGP Graceful Restart
  bgp graceful-restart
  bgp graceful-restart preserve-fw-state
  bgp graceful-restart restart-time 300
!
  # Setup peer groups
  neighbor kubepeers peer-group
  neighbor kubepeers remote-as 65003
!
  # Add peers
  neighbor 10.241.0.20   peer-group kubepeers
  neighbor 10.241.0.21   peer-group kubepeers
  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! Configure IPv4 family
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
address-family ipv4 unicast
  # Activate ipv4 for the kubepeers peer groups
  neighbor kubepeers activate
!
  # Setup this configuration as a route-server, see:
  # https://docs.frrouting.org/en/latest/bgp.html#configuring-frr-as-a-route-server
  neighbor kubepeers route-server-client
!
  # Filter imports & exports via route-map first
  neighbor kubepeers route-map IMPORTv4 in
  neighbor kubepeers route-map UNACCEPTED out
!
  # "import" and "export" are different than the normal "in" and "out" definitions that we normally see in policy
  # This is tied to route-server-client definition above
  neighbor kubepeers route-map IMPORTv4 import
  neighbor kubepeers route-map UNACCEPTED export
!
  # Allows us to generate inbound updates from a neighbor, change and activate BGP policies without clearing the BGP session
  neighbor kubepeers soft-reconfiguration inbound
exit-address-family
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! Configure IPv6 family
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
address-family ipv6 unicast
  # Activate ipv6 for the kubepeers peer groups
  neighbor kubepeers activate
!
  # Setup this configuration as a route-server, see:
  # https://docs.frrouting.org/en/latest/bgp.html#configuring-frr-as-a-route-server
  neighbor kubepeers route-server-client
!
  # Filter imports & exports via route-map first
  neighbor kubepeers route-map IMPORTv6 in
  neighbor kubepeers route-map UNACCEPTED out
!
  # "import" and "export" are different than the normal "in" and "out" definitions that we normally see in policy
  # This is tied to route-server-client definition above
  neighbor kubepeers route-map IMPORTv6 import
  neighbor kubepeers route-map UNACCEPTED export
!
  # Allows us to generate inbound updates from a neighbor, change and activate BGP policies without clearing the BGP session
  neighbor kubepeers soft-reconfiguration inbound
exit-address-family
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! Setup IP Prefix lists
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# Allow external IP range and allows /32 addresses to be specified
ip prefix-list pl-allowed-adv seq 5 permit 10.243.0.0/24 le 32
# Allow pod IP addresses and allows /24 addresses to be specified (which is the default from kube-controller-manager)
ip prefix-list pl-allowed-adv seq 10 permit 10.242.0.0/16 le 24
# Allow Cluster IP Addresses (from Kubernetes default range) and allows /32 addresses to be specified
# This is disabled for now, but in order for this to work, kube-router would need to be configured with: --advertise-cluster-ip
# ip prefix-list pl-allowed-adv seq 15 permit 10.96.0.0/12 le 32
# Deny all other BGP imports
ip prefix-list pl-allowed-adv seq 50 deny any
!
# Not exactly sure how to configure this just yet, but this is a rough attempt for IPv6 testing
ipv6 prefix-lists pl-allowed-v6-adv seq  5 permit 2001:0DB8:0000::/48 le 64
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! Setup Route Maps
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# Allows us to filter imports from the prefix-list
route-map IMPORTv4 permit 10
  match ip address prefix-list pl-allowed-adv
  set metric 10
!
route-map IMPORTv6 permit 10
  match ipv6 address prefix-list pl-allowed-v6-adv
  set metric 10
!
# Deny any export paths
route-map UNACCEPTED deny 1

camrossi commented 2 years ago

Thank you for the detailed reply @aauren!

Yes my nodes are annotated correctly:

kube-router.io/peer.asns: 65002,65002
kube-router.io/peer.ips: 192.168.12.203,192.168.12.204
kube-router.io/peer.passwords: MTIzQ2lzY28xMjM=,MTIzQ2lzY28xMjM=

I managed to recreate the issue with GoBGP, it's due to IPv6 being enabled on GoBGP but not on my routers. This https://github.com/osrg/gobgp/issues/2524

For example with this config all works perfectly fine, see that under multiprotocol I only have ipv4-unicast:

BGP neighbor is 192.168.12.201, remote AS 65002
  BGP version 4, remote router ID 1.1.1.1
  BGP state = ESTABLISHED, up for 00:00:04
  BGP OutQ = 0, Flops = 0
  Hold time is 3, keepalive interval is 1 seconds
  Configured hold time is 90, keepalive interval is 30 seconds

  Neighbor capabilities:
    multiprotocol:
        ipv4-unicast:   advertised and received
    route-refresh:  advertised and received
    extended-nexthop:   advertised and received
        Local:  nlri: ipv4-unicast, nexthop: ipv6
        Remote: nlri: ipv4-unicast, nexthop: ipv6
    graceful-restart:   advertised and received
        Local: restart time 120 sec
        ipv4-unicast
    4-octet-as: advertised and received
    UnknownCapability(66):  received
    UnknownCapability(67):  received
    fqdn:   advertised
      Local:
         name: nkt-k8s-node, domain:
    cisco-route-refresh:    received
  Message statistics:
                         Sent       Rcvd
    Opens:                  1          1
    Notifications:          0          0
    Updates:                1        201
    Keepalives:             5          6
    Route Refresh:          0          0
    Discarded:              0          0
    Total:                  7        208
  Route statistics:
    Advertised:             1
    Received:             200
    Accepted:             200

Restarting GoBGP has no delay in advertising the routes but the moment I configure GoBGP to do v6 as well then the issue happens:

BGP neighbor is 192.168.12.201, remote AS 65002
  BGP version 4, remote router ID 1.1.1.1
  BGP state = ESTABLISHED, up for 00:00:07
  BGP OutQ = 0, Flops = 0
  Hold time is 3, keepalive interval is 1 seconds
  Configured hold time is 90, keepalive interval is 30 seconds

  Neighbor capabilities:
    multiprotocol:
        ipv4-unicast:   advertised and received
        ipv6-unicast:   advertised <===== this is the issue

I have not configured v6 on my switches and my K8s nodes are v4 only as well so why kube-router enables v6 ?

I tested by deleting the the AfiSafiConfig for the Family_AFI_IP6 here and here

Now when kube-rotuer comes up there is no more ipv6-unicast in the multiprotocol section and GR works just fine. I do not think is a misconfiguration on my side, I don't think not configuring IPv6 on my rotuers is an issue, kube-router should either not wait for the IPv6 MP_UNREACH_NLRI message (but this seems to be a gobgp issue) or just not configure IPv6 in the first place. Perhaps adding an --enable-ipv6 options would be an idea?

aauren commented 2 years ago

@camrossi I think that I agree with you. At least as the Network Routes Controller (NRC) is currently written it is mean to work with IPv4 or IPv6 exclusively. As such there shouldn't be any use-case where both IPv6 and IPv4 peers should be set at the same time. There is already a semantic for checking this in the NRC code via the variable nrc.isIpv6() so I created #1327 to address this issue.

camrossi commented 2 years ago

Thank you, I will test the fix today!

camrossi commented 2 years ago

Just tested from your fork and it works perfectly !

cloudnativelabs / kube-router

No routes advertised for bgp-graceful-restart-deferral-time if graceful-restart is enabled #1323