bio-routing / bio-rd

bio routing is a project to create a versatile, fast and reliable routing daemon in Golang. bio = BGP + IS-IS + OSPF
Apache License 2.0
284 stars 44 forks source link

Extra RAM being used by BGP process ? #465

Open netixx opened 4 months ago

netixx commented 4 months ago

Describe the bug I am running the BGP server with around 20 peers, each with around 1M routes.

I am seeing high RAM usage. Running a pprof heap dump, I get the following flamegraph: image

It looks to me like some resources are not released when the routes are processed by the filters ?

Steps to Reproduce Run the router and check for RAM allocation.

Expected behavior Only the RIB component should use a lot of RAM

Configuration used

b.AddPeer(server.PeerConfig{
            LocalAS: 16276,
            PeerAS: 16276,
            RouterID: addr.ToUint32(),
            PeerAddress: ip.Ptr(),
            LocalAddress: locAddr.Ptr(),
            AdminEnabled: true,
            VRF: defaultVRF,
            Passive: true,
            AdvertiseIPv4MultiProtocol: true,
            IPv4: &server.AddressFamilyConfig{
                AddPathRecv: true,
                ImportFilterChain: filter.NewAcceptAllFilterChain(),
                ExportFilterChain: filter.NewDrainFilterChain(),
            },
            IPv6: &server.AddressFamilyConfig{
                AddPathRecv: true,
                ImportFilterChain: filter.NewAcceptAllFilterChain(),
                ExportFilterChain: filter.NewDrainFilterChain(),
            },

Additional context

We are running add-path with both IPv4 and IPv6 AFIs and unicast SAFI.

taktv6 commented 4 months ago

Hi, thanks for reaching out. I'm very curious now: How many prefixes/routes are you sending over to the process? We're well aware our BGP memory footprint is anything but very efficient at the moment. We had plans to improve that but didn't find the time yet to fix it.

netixx commented 4 months ago

At the time of the heap dump, I had around 29M path in the BGP table (that is bio_bgp_route_received_count). Receiving between 500 and 700 updates per second from 21 peers (around 30 updates/second per peer) - from bio_bgp_update_received_count. Peers each accounting for 1.35M to 1.4M routes.

What troubles me is that we can see on the right side the RIB with a lot of small objects, which is expected.

But on the left side, there seem to be even more RAM held by github.com/bio-routing/bio-rd/protocols/bgp/server.(*fsmAddressFamily).updates.

I don't understand why this functions holds that much memory, since mostly it should end up pushing the route to the routing table (either adjRibIn or locRIB).

Next thing I guess to optimise RAM, is to use a copy on write system for paths between the adjRibIn and locRIB, only storing modified values, instead of always copying the path.

Let me know if I can be of some help in that regard :)

Side note, I also get a lot of CPU use for garbage collector, which could mean that there are more allocs going on that we want: image

For additional reference, here is the "alloc" graph: image

In another project I am looking at, they are using https://github.com/kentik/patricia (in particular https://github.com/kentik/patricia/tree/main/generics_tree) for RIB storage, which seems really efficient (example here: https://github.com/akvorado/akvorado/blob/main/inlet/routing/provider/bmp/rib.go), especially in terms of garbage collection.