v1.5.1 does not advertise externalIP addresses to external bgp peers

rkojedzinszky commented 2 years ago

What happened? Upgraded kube-router from v1.5.0 to v1.5.1

What did you expect to happen? Expected the externalIP services to be advertised as before

How can we reproduce the behavior you experienced? Steps to reproduce the behavior:

Set up services with externalIPs
Run kube-router v1.5.0 with --advertise-external-ip=true
Set up an external peer with node annotations
Observe the advertised address on external router
Upgrade kube-router to v1.5.1
Observe the advertisement is missing

Screenshots / Architecture Diagrams / Network Topologies

Three nodes are annotated to set up kube-router with external bgp peers.

System Information (please complete the following information):

Kube-Router Parameters:

  - args:
    - --run-router=true
    - --run-firewall=true
    - --run-service-proxy=true
    - --bgp-graceful-restart=true
    - --bgp-graceful-restart-deferral-time=30s
    - --advertise-pod-cidr=false
    - --advertise-external-ip=true
    - --kubeconfig=/var/lib/kube-router/kubeconfig
    - --iptables-sync-period=24h
    - --ipvs-sync-period=24h
    - --routes-sync-period=24h
    - --auto-mtu=false

Kubernetes Version (kubectl version) : 1.25.3
Cloud Type: bare metal
Kubernetes Deployment Type: kubeadm
Kube-Router Deployment Type: daemonset
Cluster Size: 8 nodes

Additional context BGP information with v1.5.0 on the upstream router:

#sh ip bgp summary 
BGP router identifier 192.168.1.1, local AS number 64640
BGP table version is 158, main routing table version 158
3 network entries using 408 bytes of memory
5 path entries using 280 bytes of memory
2/2 BGP path/bestpath attribute entries using 256 bytes of memory
1 BGP AS-PATH entries using 24 bytes of memory
0 BGP route-map cache entries using 0 bytes of memory
0 BGP filter-list cache entries using 0 bytes of memory
BGP using 968 total bytes of memory
BGP activity 41/38 prefixes, 150/145 paths, scan interval 60 secs

Neighbor        V           AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd
...
192.168.8.1     4        64512      18       8      158    0    0 00:01:54        1
192.168.8.2     4        64512      18       6      158    0    0 00:01:51        1
192.168.8.4     4        64512       6       8      158    0    0 00:01:54        1
#sh ip bgp         
BGP table version is 158, local router ID is 192.168.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale, m multipath, b backup-path, x best-external, f RT-Filter
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
...
*  192.168.8.192/32 192.168.8.2                            0 64512 i
*                   192.168.8.1                            0 64512 i
*>                  192.168.8.4                            0 64512 i

You can see that the advertised address is 192.168.8.192 and being advertised by 3 nodes.

After upgrading kube-router to v1.5.1, bgp is established, but the prefix is not advertised:

#sh ip bgp summary 
BGP router identifier 192.168.1.1, local AS number 64640
BGP table version is 160, main routing table version 160
2 network entries using 272 bytes of memory
2 path entries using 112 bytes of memory
1/1 BGP path/bestpath attribute entries using 128 bytes of memory
0 BGP route-map cache entries using 0 bytes of memory
0 BGP filter-list cache entries using 0 bytes of memory
BGP using 512 total bytes of memory
BGP activity 41/39 prefixes, 150/148 paths, scan interval 60 secs

Neighbor        V           AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd
...
192.168.8.1     4        64512       6       4      160    0    0 00:00:49        0
192.168.8.2     4        64512       5       4      160    0    0 00:00:52        0
192.168.8.4     4        64512       3       5      160    0    0 00:00:53        0
#sh ip bgp         
BGP table version is 160, local router ID is 192.168.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale, m multipath, b backup-path, x best-external, f RT-Filter
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
...

You can see that the graceful restart deferral time (30 seconds on command line) has passed already, and 192.168.8.192 still not advertised. With v1.5.0 the advertisement appears exactly after 30 seconds.

rkojedzinszky commented 2 years ago

It seems that thie behavior is only present when graceful-restart feature is turned on. Howewer, I really like this feature, as this makes kube-router upgrades possible without service disruption.

aauren commented 2 years ago

@rkojedzinszky the only thing that changed for BGP between the two releases is https://github.com/cloudnativelabs/kube-router/pull/1327

Previously, we were setting graceful restart on the afi-safi config for both IPv4 & IPv6 no matter whether IPv6 was enabled or not. After #1323 was raised we decided to shift it to only advertise the family that was actually being used. This likely means that your BGP receiver (192.168.1.1) needs to have its BGP configuration updated to only declare graceful-restart on the IPv4 afi-safi configuration.

Another thing to look into would be your graceful-restart configuration on your BGP receiver in general. If graceful-restart is setup correctly, it should not be waiting the full 30 seconds for the stale routes to clear. Essentially, when kube-router comes back up, gobgp is going to peer with your receiver and advertise it's routes again. Once gobgp is sure that it has sent the most current state it will send an end of RIB marker which should cause your receiver to un-mark the previous routes as stale and update it's table. If this is not happening, then it was likely that your afi-safi config already had some problem with it already.

In any case, I'm not able to reproduce this locally with FRR configured correctly.

Here is my node configuration:

$ kubectl get nodes -o json | jq -r '.items[] | .metadata | .name + " - " + .annotations."kube-router.io/peer.ips" + " - " + .annotations."kube-router.io/peer.asns"'
kube-router-vm1 - 10.241.0.10 - 4200000001
kube-router-vm2 - 10.241.0.10 - 4200000001

Here is the relevant portions of my kube-router config:

$ kubectl get daemonset -n kube-system kube-router -o yaml
apiVersion: apps/v1                   
kind: DaemonSet          
metadata:                        
  ...
  name: kube-router
  namespace: kube-system  
spec:
  selector:     
    matchLabels:                
      k8s-app: kube-router  
      tier: node          
  template:             
    metadata:    
      creationTimestamp: null    
      labels:               
        k8s-app: kube-router
        tier: node
    spec:                                      
      containers: 
      - args:                
        - --run-router=true
        - --run-firewall=true
        - --run-service-proxy=true
        - --bgp-graceful-restart=true
        - --kubeconfig=/var/lib/kube-router/kubeconfig
        - --runtime-endpoint=unix:///run/containerd/containerd.sock
        - --cluster-asn=4200000001
        - --service-cluster-ip-range=10.96.0.0/16
        - --service-external-ip-range=10.243.0.0/24
        - --advertise-external-ip=true
        - -v=1  
        env:           
        - name: NODE_NAME  
          valueFrom:                       
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: KUBE_ROUTER_CNI_CONF_FILE
          value: /etc/cni/net.d/10-kuberouter.conflist
        image: docker.io/cloudnativelabs/kube-router:v1.5.0

Then I setup a loop to watch FRR's BGP as I updated from v1.5.0 to v1.5.1:

$ while :; do date; vtysh -c "show bgp detail"; echo; echo; sleep 5; done
Fri Oct 28 13:49:30 UTC 2022                                             
BGP table version is 4, local router ID is 10.241.0.10, vrf id 0
Default local pref 100, local AS 4200000001
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,                                                                    
               i internal, r RIB-failure, S Stale, R Removed                                                                                       
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self 
Origin codes:  i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*>i10.242.0.0/24    10.241.0.20             10    100      0 i
*>i10.242.1.0/24    10.241.0.21             10    100      0 i                                                                                     
*>i10.243.0.1/32    10.241.0.20             10    100      0 i
*=i                 10.241.0.21             10    100      0 i      

Displayed  3 routes and 4 total paths           

Fri Oct 28 13:49:35 UTC 2022         
BGP table version is 4, local router ID is 10.241.0.10, vrf id 0
Default local pref 100, local AS 4200000001
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*>i10.242.0.0/24    10.241.0.20             10    100      0 i
S>i10.242.1.0/24    10.241.0.21             10    100      0 i
*>i10.243.0.1/32    10.241.0.20             10    100      0 i
S=i                 10.241.0.21             10    100      0 i

Displayed  3 routes and 4 total paths

Fri Oct 28 13:49:40 UTC 2022
BGP table version is 4, local router ID is 10.241.0.10, vrf id 0
Default local pref 100, local AS 4200000001
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
S>i10.242.0.0/24    10.241.0.20             10    100      0 i
S>i10.242.1.0/24    10.241.0.21             10    100      0 i
S>i10.243.0.1/32    10.241.0.20             10    100      0 i
S=i                 10.241.0.21             10    100      0 i

Displayed  3 routes and 4 total paths

Fri Oct 28 13:49:45 UTC 2022
BGP table version is 4, local router ID is 10.241.0.10, vrf id 0
Default local pref 100, local AS 4200000001
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
S>i10.242.0.0/24    10.241.0.20             10    100      0 i
S>i10.242.1.0/24    10.241.0.21             10    100      0 i
S>i10.243.0.1/32    10.241.0.20             10    100      0 i
S=i                 10.241.0.21             10    100      0 i

Displayed  3 routes and 4 total paths

Fri Oct 28 13:49:51 UTC 2022
BGP table version is 4, local router ID is 10.241.0.10, vrf id 0
Default local pref 100, local AS 4200000001
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
S>i10.242.0.0/24    10.241.0.20             10    100      0 i
*>i10.242.1.0/24    10.241.0.21             10    100      0 i
S>i10.243.0.1/32    10.241.0.20             10    100      0 i
*=i                 10.241.0.21             10    100      0 i

Displayed  3 routes and 4 total paths

Fri Oct 28 13:49:56 UTC 2022
BGP table version is 4, local router ID is 10.241.0.10, vrf id 0
Default local pref 100, local AS 4200000001
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
S>i10.242.0.0/24    10.241.0.20             10    100      0 i
*>i10.242.1.0/24    10.241.0.21             10    100      0 i
S>i10.243.0.1/32    10.241.0.20             10    100      0 i
*=i                 10.241.0.21             10    100      0 i

Displayed  3 routes and 4 total paths

Fri Oct 28 13:50:01 UTC 2022
BGP table version is 4, local router ID is 10.241.0.10, vrf id 0
Default local pref 100, local AS 4200000001
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*>i10.242.0.0/24    10.241.0.20             10    100      0 i
*>i10.242.1.0/24    10.241.0.21             10    100      0 i
*>i10.243.0.1/32    10.241.0.20             10    100      0 i
*=i                 10.241.0.21             10    100      0 i

Displayed  3 routes and 4 total paths

You can see above, that when I change to v1.5.1 and as kube-router is restarted across the 2 nodes, all of the routes go stale, but within ~20 seconds the first node's VIPs come back and after 10 more seconds, all of them are received and marked as current.

Here is the template that I use for my FRR configuration for reference: https://github.com/aauren/kube-router-automation/blob/main/ansible/playbooks/roles/bgp_router/templates/frr.conf.j2

rkojedzinszky commented 2 years ago

@aauren

Thanks for your quick response!

You are right, my bgp receiver is not configured for graceful restart. Howewer, graceful restart is just a capability, for correct operation that is not needed to have both sides the same configuration regarding this setting.

My use case is:

I want to be able to upgrade kube-router without service disruption. The disruption here means inter-pod communication in the k8s cluster. Not with external peers. Thats why I have multiple nodes set up to peer with uplink router. When I intentionally reboot a node, bgp tcp connection will close, and that will immediately cease routes towards that node. If graceful restart was configured on the uplink router too, then nothing would change on the upstream router. So I use graceful shutdown inside the cluster, but dont want to use it outside.

Is there something wrong with this setup?

aauren commented 2 years ago

In general I would say nothing is wrong with the setup. My understanding is that a proper BGP peer should just ignore the graceful of restart option if it isn't configured for it even if the other peer advertises the capability. However I've never tested it myself.

However, given that graceful restart is the only thing that has changed between v1.5.0 and v1.5.1 and after making the configuration more proper on the kube-router side your router no longer appears to be receiving routes, I would say that something about either your bgp implementation or configuration is touchy about graceful restart options. Maybe try configuring it as a test and see if you still get routes? Or maybe experiment with a different speaker like gobgp or frr?

With the tools that I have, I'm not able to reproduce any issues so you're probably going to have to dive in some more.

rkojedzinszky commented 2 years ago

@aauren

It seems that I could reproduce it with frr too.

I'll paste my all relevant configuration here:

frr running-config:

node-1# sh running-config 
Building configuration...

Current configuration:
!
frr version 7.5.1
frr defaults traditional
hostname node-1
log syslog informational
no ip forwarding
no ipv6 forwarding
service integrated-vtysh-config
!
router bgp 10
 no bgp ebgp-requires-policy
 bgp graceful-restart-disable
 neighbor 192.168.111.20 remote-as 64512
!
line vty
!
end
node-1# sh ip bgp summary 

IPv4 Unicast Summary:
BGP router identifier 192.168.111.21, local AS number 10 vrf-id 0
BGP table version 6
RIB entries 0, using 0 bytes of memory
Peers 1, using 21 KiB of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt
192.168.111.20  4      64512        75        82        0    0    0 00:01:32            0        0

Total number of neighbors 1

kube-router v1.5.1 cli args:

      - args:
        - --run-router=true
        - --run-firewall=true
        - --run-service-proxy=false
        - --bgp-graceful-restart=true
        - --bgp-graceful-restart-deferral-time=10s
        - --advertise-pod-cidr=false
        - --advertise-external-ip=true

The only k8s node annotations:

apiVersion: v1
kind: Node
metadata:
  annotations:
    kube-router.io/peer.asns: "10"
    kube-router.io/peer.ips: 192.168.111.21
    kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/crio/crio.sock
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"

The service:

root@node-0:~# kubectl get service nginx -o yaml
apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2022-10-30T06:45:43Z"
  labels:
    app: nginx
  name: nginx
  namespace: default
  resourceVersion: "2078"
  uid: 9e82bfd3-8988-4af0-8a94-3afcce3742c7
spec:
  clusterIP: 10.107.22.10
  clusterIPs:
  - 10.107.22.10
  externalIPs:
  - 1.1.1.1
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: "80"
    port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: nginx
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

With v1.5.0 frr receives the prefix, while with v1.5.1 it does not.

Please let me know what additional information you require. Perhaps it is a bug in gobgp?

aauren commented 2 years ago

@rkojedzinszky I was able to reproduce what you mentioned in your comment.

I tested the following situations:

kube-router advertises graceful restart, FRR advertises graceful restart - after upgrade to 1.5.1 routes are received
kube-router advertises graceful restart, FRR does not advertise graceful restart - after upgrade to 1.5.1 routes are not received
kube-router advertises graceful restart, FRR does not advertise graceful restart - after upgrade to 1.5.1 and restarting FRR routes are received

As best as I can tell, there is nothing wrong with the sessions that are established between FRR and kube-router in any of the above scenarios, they always come up and establish without any errors.

However, I think that you're being caught in some BGP implementation weirdness. Essentially, in 1.5.0 kube-router was setting graceful restart for the ipv4 and ipv6 families, and on your router you received them, but graceful restart wasn't enabled. I believe that this means that while it expects EOR markers to be communicated for both families, it may not have implemented any of the timers associated with graceful restart.

Then when you updated kube-router to 1.5.1 it stopped advertising the afi-safi for ipv6 entirely, including the graceful restart capabilities and the EOR markers. My best guess is that the implementation in FRR and your router is essentially now in the state that while it is able to establish a BGP session it is endlessly waiting for the EOR or the restart flag to be set on the non-existent afi-safi for IPv6 that will never be sent.

That's just a guess at this point though, I'm definitely not a BGP protocol expert.

To be honest, I think that your scenario is kinda an edge case. Where you want graceful restart within the cluster but purposefully don't enable it from your external peers. While it's unfortunate that this happened, I'm not aware of anything that kube-router or any of the BGP implementaitons could do differently. When reviewing the RFC it doesn't appear to take into account that a peer might announce the graceful restart capability to a non-capable receiver and then no longer advertise a given afi-safi that had previously been graceful-restart enabled. So likely the implementation in various BGP speakers differs on how they handle it.

If your router's BGP implementation is similar to FRR, then restarting the BGP service or the router itself should resolve the issue as that will make it forget the peers previous graceful-restart state for the IPv6 afi-safi. If that does work, that's likely a livable outcome for this as kube-router should never go back to announcing the IPv6 afi-safi unless you specifically enable it in the future. Meaning that, while unfortunate, this is a one time thing.

In any case, I think that this is likely as far as I can go with this issue. I don't know enough about BGP's graceful-restart protocol to authoritatively say whether any of the BGP implementations are broken in how they handle this edge case and I don't see any way that we could have prevented it on the kube-router side.

If you find any more specific details about how kube-router could change it's configuration to avoid this scenario, feel free to add a comment with some implementation specifics or a PR and I'll look into it some more.

rkojedzinszky commented 2 years ago

@aauren Thanks for your tests too.

Unfortunately, while restarting frr on the remote side solves the problem, after restarting kube-router the same happens again. So, once kube-router starts later than frr, I still encounter the problem. While my case seems to be an edge case, I must say that with v1.5.0 this worked, and since 1.5.1 it does not.

Also please note that even in frr and in a cisco router, one can only enable graceful restart globally (not for specific ip families).

rkojedzinszky commented 2 years ago

Also I've tested the same with my uplink router and with frr.

My upstream router has graceful-restart disabled, while frr has it enabled (the default). Once I restart frr (as would be the case when restarting kube-router), bgp session get established, and prefix being advertised/received by upstream router.

So again, I ended up where kube-router behaves differently.

aauren commented 2 years ago

While that's true, that seems more or less a client implementation problem. The RFC clearly allows for the capability of it to be set per address family, it seems that Cisco and FRR have just chosen not to implement it that way or to give those configuration capabilities.

I'm definitely open to suggestions on ways that kube-router might be able to approach this use-case better, but just going back to announcing an AFI-SAFI that we don't actually use (like we did in v1.5.0) doesn't seem like the right thing to do here and it caused issues for other client implementations.

Let me know if you have any suggestions.

rkojedzinszky commented 2 years ago

I have another test case:

When my configuration is in this state (where no routes are advertised to upstream) and I create a new service with an externalIP, that prefix gets advertised well. But the already existing one does not. Does this help anything?

rkojedzinszky commented 2 years ago

While that's true, that seems more or less a client implementation problem. The RFC clearly allows for the capability of it to be set per address family, it seems that Cisco and FRR have just chosen not to implement it that way or to give those configuration capabilities.

My problem here is that even with a fresh frr as upstrem I cannot get the expected operation.

I'm definitely open to suggestions on ways that kube-router might be able to approach this use-case better, but just going back to announcing an AFI-SAFI that we don't actually use (like we did in v1.5.0) doesn't seem like the right thing to do here and it caused issues for other client implementations.

Let me know if you have any suggestions.

I think that the fact, that we get different results based on which end gets restarted means that there is incosistent operation.

rkojedzinszky commented 2 years ago

While that's true, that seems more or less a client implementation problem. The RFC clearly allows for the capability of it to be set per address family, it seems that Cisco and FRR have just chosen not to implement it that way or to give those configuration capabilities.

I'm definitely open to suggestions on ways that kube-router might be able to approach this use-case better, but just going back to announcing an AFI-SAFI that we don't actually use (like we did in v1.5.0) doesn't seem like the right thing to do here and it caused issues for other client implementations.

Let me know if you have any suggestions.

Should we involve gobgp developers here?

aauren commented 2 years ago

Should we involve gobgp developers here?

Sure you could file an issue upstream if you want and see if they have any thoughts.

rkojedzinszky commented 2 years ago

@aauren

I think that this is a bug in gobgp, and the latest change in kube-router (#1327) regarding graceful-restart did a workaround for cases when both ends agreed on graceful restart capability. Howewer, the bug is still there, and now we get a different (worse) symptom.

aauren commented 2 years ago

Interesting... Hopefully something will come out of https://github.com/osrg/gobgp/issues/2596 to maybe help expose what's going on

cloudnativelabs / kube-router

v1.5.1 does not advertise externalIP addresses to external bgp peers #1389