Reconfig rate-limiting is taking into account cache memory usage

javiku commented 3 years ago

Describe the bug

Ambassador Pods will "rate-limit" reconfiguration operations if they detect the memory usage is too high. For determining current memory usage, cache memory is being taken into account (the value reported by memory.usage_in_bytes includes both RSS and Cache memory), which results in frequently hitting the reconfig rate-limit after some time in long-running clusters with many reconfigurations.

It would probably be more accurate to be looking at only non-cache memory usage, since that is the critical value, and the typical value shown in Grafana dasboards for Pod memory usage.

To Reproduce

Our setup:

Ambassador Pods deployed with 800MB memory limit
Keep a high number of Hosts/Mapping objects (we have seen this with 50-70)
Perform frequent operations in the cluster that translate in Ambassador reconfigurations: add/remove Hosts/Mappings, update endpoints, etc.
Over time, more cache memory will be used in the Pod and reconfig rate-limiting rules defined in ambex/ratelimit.go/updaterWithTicker will be triggered. After some time, if memory usage is over 90%, only one reconfiguration will be applied every 10 minutes.
Note that standard Grafana Pods dashboards only report 300MB memory being used. If we check the usage including cache, it is around 770MB (over 90%).

We see this in both v1.11.1 and v1.12.1.

Expected behavior

We would expect that reconfiguration rate-limiting only fires if critical memory usage is high, meaning memory usage excluding cache usage. When processes require more memory and there is none free, cache memory is automatically released. Think for example of Grafana dashboards that show Pod memory usage: they display "real" memory usage, excluding cache.

Moreover, given Kubernetes constraints, it is not easy (or even feasible) to configure cache memory usage and disposal.

Versions

Ambassador: 1.11.1 and 1.12.1
Kubernetes environment: bare-metal
Version 1.16.10 and 1.20.4

Additional context

After seeing delays of up to 10 minutes in Ambassador processing endpoint updates, we started digging into the code and got to this point:

cmd/ambex/ratelimit.go - updaterWithTicker

[...]
usagePercent := getUsage()

var maxStaleReconfigs int

switch {
case usagePercent >= 90:
  // With the default 10 minute drain time this works out to an average of one reconfig
  // every 10 minutes. This will guarantee the minimum possible memory usage due to stale
  // configs.
  maxStaleReconfigs = 1
[...]

So it seems that if Ambassador detects a memory usage of more than 90% if will only perform one reconfig every 10 minutes. During our tests, we saw in Grafana Ambassador Pods using 300MB of the 800MB limit they had assigned, so we didn't understand why this was happening.

But then we found this:

pkg/memory/memory.go - readUsage

[...]
usage, err := readMemory("/sys/fs/cgroup/memory/memory.usage_in_bytes")
[...]

The value reported by memory.usage_in_bytes includes both RSS and Cache memory. This means the memory usage function is using a value that includes cache memory, which we didn't expect. The problem with taking cache into account is that cache can grow very easily (+400MB in our case) and is never released unless needed by some process, so it is hard to avoid this problem unless restarting the Pods or including some cache release mechanism.

roystchiang commented 3 years ago

Is it possible to use a metric similar to container_memory_working_set_bytes, which gives a rough estimate of how much memory cannot be evicted? This is also one of the metrics kubernetes monitors to determine whether to kill a pod or not.

We had to rollback the Ambassador upgrade to a version without the reconfig ratelimiter. I'm also wondering if we should expose to flag to disable the reconfig ratelimiter, so the user can opt-out while experiencing operational pain.

I'm happy to submit a PR for this fix, if we can agree on the next steps.

esmet commented 3 years ago

@roystchiang that does sound reasonable to me and a PR would be very welcome

adrienzieba commented 3 years ago

I experience the same issue but I think that this new version should address this issue https://github.com/emissary-ingress/emissary/blob/master/CHANGELOG.md#1137-june-03-2021

alanprot commented 2 years ago

Was this fixed by https://github.com/emissary-ingress/emissary/commit/61b94d0c62a297a0bfcf9ee395f01030ed5443dc ? Should we resolve this?

emissary-ingress / emissary

Reconfig rate-limiting is taking into account cache memory usage #3332