bio-routing / bio-rd

bio routing is a project to create a versatile, fast and reliable routing daemon in Golang. bio = BGP + IS-IS + OSPF
Apache License 2.0
282 stars 44 forks source link

bgpPathACache leaks memory #472

Open 4xoc opened 1 month ago

4xoc commented 1 month ago

Describe the bug bgpPathACache causes large memory leaks and is itself not caching anything that can be retrieved. The result is an ever increasing map which consumes all memory until depletion. This is a combination of issues described in detail below.

Issue 1 - BGPPathA duplication

p (type BGPPathA) is copied and added to the map index, resulting in a duplication of the contents. https://github.com/bio-routing/bio-rd/blob/63eef2bb3b7a83a1b91df16207e72586cf871e66/route/bgp_path_cache.go#L34

Issue 2 - impossible cache hit

BGPPathA contains ptrs that always reference new memory (line 131 & 138). The map index therefore becomes unique for each new BGPPathA created. No dedup happens here. https://github.com/bio-routing/bio-rd/blob/63eef2bb3b7a83a1b91df16207e72586cf871e66/route/bgp_path.go#L129-L145

Issue 3 - missing cache eviction

Even if this cache would work, there is no mechanism for cache eviction. For very long running instances and many changes, this would always grow until no memory is available anymore.

Steps to Reproduce run ris-mirror, have route updates, ram go brrr and poof

Expected behavior ris-mirros shouldn't leak memory and can run for long times

Configuration used this is config independent as the code path is always used in ris-mirror

Additional context I've done a lot of profiling on this issue und very confident this is the major source of leaks for ris-mirror. Simply running without cache works very nicely and shows generally better peformance. Happy to provide a PR to fix this, depening on what solution is preferred. Removing cache seems like a better tradeoff than redesigning the entire cache strategy for probably marginal benefits.

taktv6 commented 1 month ago

Hi, thanks for reaching out.

The general idea is to deduplicate parts of BGP path attributes when they are the same. Not freeing unused resources is a known issue and is very welcome to get addresses. Not having any deduplication happening makes bio-rd consume multiple GB of RAM for just one IPv4 full table. That's why that things was introduced in the first place.

Let me get over the issues one by one: On 1: I think that's not 100% correct. The value of what p points to is being used as the index on the map. When two PathA objects with identical content come up, https://github.com/bio-routing/bio-rd/blob/63eef2bb3b7a83a1b91df16207e72586cf871e66/route/bgp_path_cache.go#L31 will return a pointer to the existing object. With the old pointer gone the old object will be garbage collected.

On 2: Yes, that's indeed a bug (that affects ris-mirror, but not ris if I'm not mistaken). The IPs need to be deduplicated themselves first before the PathA is being deduplicated.

On 3: As written above. The missing cache eviction is a known issue and a fix welcome.

4xoc commented 1 month ago

As far as I understand the code, your argument about a full table doesn't hold because the cache, as it is at the moment, doesn't do anything usefull. In combination with the second issue it can never match any entry because the ptrs are unique and thus is the hash of the map index. Unfortunately I don't have the possibility to test with a full table so I can't confirm this with data.

Let me clarify on the first issue, the duplication of data happens in the map itself, by design of maps. Go just doesn't save the hash of the struct but the entire index & the value, effectively copying th full structure in this particular instance. This happens even when the cache would be working correctly. Adding to this, any identical BGPPathA would have to exists at least two times just to be as efficient as not using this cache at all (and that's not even counting the overhead of the map itself although perhaps negligible). It indeed appears to be a problem only in the ris-mirror using the BGPPathA cache, not any other cache instance anywhere else. The cache mechanism itself works "as designed" for other instances because no ptrs exist and therefore could very well help on full-tables (including possible inefficiencies when <= 2 identical entries exist).

Pretty much the same applies to IPs being deduplicated in terms of efficiency.

What would you like to see happen here? I believe a cache for BGPPathA is not giving any real world benefits that justify its use or the time to invest in a proper caching strategy. With ptrs in the mix I think the options are quite limited: 1) not using ptrs at all 2) Using a different cache map index like a hash of the struct and the resolved ptr values. This could lead to collisions though or with larger hash types cause the index to be even larger than the raw data.

taktv6 commented 1 month ago

it can never match any entry because the ptrs are unique

No. That's only true when IPs (Nexthop, Source) are not deduplicated down to the same pointer, or when Aggregator is not nil. When IP address 192.0.2.0 is reliably resolved into the same pointer by the IP deduplication the ptrs are not unique, and thus the cache working. (That the cache in generals work can be seen here: https://go.dev/play/p/A0--_Gyh7vF)

True is, that the Nexthop and Source attributes are not deduplicated in ris-mirror. That's a bug.

We've intentionally split the BGP attributes into BGPPath and BGPPathA. All attributes that are commonly the same have been moved into BGPPathA. In case there are not enough paths that share common values for attributes in BGPPathA, the solution can indeed be less efficient. But that's okay.