SovereignCloudStack / issues

This repository is used for issues that are cross-repository or not bound to a specific repository.
https://github.com/orgs/SovereignCloudStack/projects/6
2 stars 1 forks source link

SONiC - Integrate FRR Management Framework in SONiC to eliminate split configuration #719

Closed matofeder closed 3 weeks ago

matofeder commented 2 months ago

SONiC currently uses a split-mode like configuration where SONiC settings are managed in config_db.json, and routing is configured separately in frr.conf.

This approach often falls short due to the limitations of the default bgpcfgd configuration tool, leading to the need for split configurations.

The frr-mgmt-framework offers a more comprehensive solution by supporting BGP, OSPF, STATIC, IGMP, PIM, VRFs, and BGP EVPN within a single integrated configuration mechanism.

To enable this, add "frr_mgmt_framework_config": "true" to DEVICE_METADATA.

Clarify whether the frr-mgmt-framework supports all necessary configuration options required to achieve the desired L3 underlay configuration for SCS.

matofeder commented 1 month ago

The approach to utilizing a unified FRR mgmt interface and eliminating SONiC's split configuration has several main issues:

1. Available Edge-core SONiC images (SONiC.202211, SONiC.202111, SONiC.202012, SONiC.202006, SONiC.201911) do not include a bug fix #13109 for frrcfgd.

This fix is available only in the Edgecore SONiC branches: 202305, 202311, 202311.0, 202311.X, master, and pre_202305.

Without this fix, the FRR management framework does not behave as expected. Specifically, frrcfgd fails to interpret the Config DB BGP entries correctly, leading to errors such as:

Sep 11 09:53:08.188862 st01-sw1g-r01-u42 INFO bgp#frrcfgd: value for table BGP_PEER_GROUP prefix default key LEAF changed to {'admin_status': (true, ADD), 'asn': (65501, ADD), 'peer_type': (external, ADD)}
Sep 11 09:53:08.190100 st01-sw1g-r01-u42 DEBUG bgp#frrcfgd: execute command vtysh -c 'configure terminal' -c 'router bgp 65000 vrf default' -c 'neighbor LEAF remote-as 65501' for table BGP_PEER_GROUP.
Sep 11 09:53:08.190100 st01-sw1g-r01-u42 DEBUG bgp#frrcfgd: VTYSH CMD: configure terminal daemons: ['bgpd']
Sep 11 09:53:08.190100 st01-sw1g-r01-u42 DEBUG bgp#frrcfgd: VTYSH CMD: router bgp 65000 vrf default daemons: ['bgpd']
Sep 11 09:53:08.190100 st01-sw1g-r01-u42 DEBUG bgp#frrcfgd: VTYSH CMD: neighbor LEAF remote-as 65501 daemons: ['bgpd']
Sep 11 09:53:08.190132 st01-sw1g-r01-u42 DEBUG bgp#frrcfgd: [bgpd] command return code: 13
Sep 11 09:53:08.190132 st01-sw1g-r01-u42 DEBUG bgp#frrcfgd: % Create the peer-group or interface first
Sep 11 09:53:08.190147 st01-sw1g-r01-u42 DEBUG bgp#frrcfgd: VTYSH CMD: end daemons: ['bgpd']
Sep 11 09:53:08.190174 st01-sw1g-r01-u42 ERR bgp#frrcfgd: command execution failure. Command: "vtysh -c 'configure terminal' -c 'router bgp 65000 vrf default' -c 'neighbor LEAF remote-as 65501'"
Sep 11 09:53:08.190174 st01-sw1g-r01-u42 ERR bgp#frrcfgd: failed running FRR command: neighbor LEAF remote-as 65501

In this case, frrcfgd recognizes the BGP_PEER_GROUP, but it fails to translate it into proper FRR-BGP CLI commands.

After manually applying the fix to the SONiC.202211 build, the FRR management framework functioned as expected, enabling the unified FRR configuration to work properly.

What should we do next with edge-core SONiC images?

2. FRR mgmt interface (frrcfgd) does not support all FRR configuration options

matofeder commented 1 month ago

After investigating whether FRR unified management is compatible with the required BGP configuration of the SCS hardware landscape and whether it would be suitable for configuring L3 BGP underlay networking with features like BGP unnumbered on Edge-Core enterprise SONiC I can conclude the following:

The known issues described above, 2a and 2b, affect both Sonic Edge-Core and the SONiC community. Issue 1 affects only enterprise SONiC Edge-Core.

For enterprise SONiC Edge-Core, issue 1 is blocking, meaning we have to wait until Edge-Core releases a SONiC build with the fix or build it ourselves.

The two other issues, 2a and 2b impact both, IMO these are not blocking but introduce unexpected system behavior, which could significantly reduce the ability to debug potential further issues and complicate overall network maintenance.

Based on the above, my recommendation is not to use FRR SONiC unified configuration, for now, and instead focus on contributing upstream so that this becomes feasible soon (though it’s uncertain how or whether we can influence the enterprise SONiC distribution).

@scoopex fyi

matofeder commented 1 month ago

Test with Stordis SONiC release 4.4.0 (sonic-broadcom-enterprise-base-4-4-0.bin)

1. TL;DR: frrcfgd works (bug #13109 seems to be not an issue)

It seems that Stordis SONiC release 4.4.0 does not include bug fix #13109

$ docker exec -it bgp sh -c "cat /usr/local/lib/python3.9/dist-packages/frrcfgd/frrcfgd.py | grep listen_thread | wc -l"
0

But the frrcfgd.py code is different from the community and Ende-core version. The Community and Ende-core frrcfgd.py script contains 3832LOC and the Stordis one contains 5553LOC (so evidently some logic has been added to the Stordis version)

Test with the following FRR unified config :

{
  "BGP_GLOBALS": {
        "default": {
            "local_asn": "65000",
            "router_id": "10.0.1.2"
        }
    },
     "BGP_PEER_GROUP": {
        "default|LEAF": {
          "admin_status": "true",
          "asn": "65501",
          "peer_type": "external"
        }
    }
}

Apply it:

$ config load frr.conf  -y
Running command: /usr/local/bin/db_migrator.py -o check_version -f frr.conf
Running command: /usr/local/bin/sonic-cfggen -j frr.conf --write-to-db

Check the result:

$ show runningconfiguration bgp
Building configuration...

Current configuration:
!
frr version 8.2.2
frr defaults traditional
hostname st01-sw1g-r01-u42
log syslog informational
log facility local4
agentx
service integrated-vtysh-config
!
password zebra
enable password zebra
!
router bgp 65000
 bgp router-id 10.0.1.2
 no bgp ebgp-requires-policy
 no bgp default ipv4-unicast
 neighbor LEAF peer-group
 neighbor LEAF remote-as external
 !
 address-family ipv4 unicast
  maximum-paths 1
  maximum-paths ibgp 1
 exit-address-family
 !
 address-family ipv6 unicast
  maximum-paths 1
  maximum-paths ibgp 1
 exit-address-family
exit
!
end

It works!

2a. It seems that Stordis SONiC release 4.4.0 contains some set srt directives:

$ docker exec -it bgp sh -c "cat /usr/local/lib/python3.9/dist-packages/frrcfgd/frrcfgd.py | grep 'set src'"
                    cmds = ["vtysh -c 'configure terminal' -c 'route-map %s permit 10' -c 'set src %s'" % (rm_name, addr_list[0]),

but, the route_map_key_map does not include set src option (at least in the main frrcfgd.py script).

Maybe there is some magic how to configure the route map set src option, but from the source code is not clear how.

@scoopex are you aware of any documentation of frrcfgd for Stordis SONiC?

2b. Test of show ip interface and show ipv6 interface commands shows that they work with Stordis SONiC as expected

See the following FRR config:

{
  "BGP_GLOBALS": {
        "default": {
            "local_asn": "65000",
            "router_id": "10.0.1.2"
        }
    },
     "BGP_PEER_GROUP": {
        "default|LEAF": {
          "admin_status": "true",
          "asn": "65501",
          "peer_type": "external"
        }
    },
     "BGP_NEIGHBOR": {
      "default|Ethernet32": {
      "peer_group_name": "LEAF"
    }
   }
}

The community and Engecore SONiC failed with the error described here. But the Stordis SONiC works like a charm.


It appears that Stordis SONiC is in better shape compared to the Community or Edge-core versions, particularly in terms of supporting the FRR unified configuration. Issues 1 and 2b don't appear to apply to Stordis SONiC. Issue 2a may or may not apply, so having documentation for Stordis SONiC would be helpful.

matofeder commented 3 weeks ago

Following our investigation (refer to https://github.com/SovereignCloudStack/issues/issues/719#issuecomment-2349084594), we aimed to contribute upstream to address the issues preventing us from utilizing integrated FRR configuration.

Our upstream contributions are detailed in https://github.com/SovereignCloudStack/sonic-buildimage/pull/4, where we have ported fixes that should enable a functional integrated FRR configuration and more.

SONiC image has been built using the above branch (https://github.com/SovereignCloudStack/sonic-buildimage/pull/4) and successfully tested on the SCS LAB environment, see https://github.com/SovereignCloudStack/hardware-landscape/pull/56