Add a Resource Monitor in Overload Manager that Tracks Memory PSI

ryansmick commented 1 month ago

Title: Add a Resource Monitor in Overload Manager that Tracks Memory PSI

Description: I’m currently using Envoy as part of Istio, and we are attempting to use Overload Manager to prevent envoy from using too much memory which could result in having the istio-proxy container OOM killed.

The method we’re currently evaluating is to track memory via the fixed_heap monitor, and take appropriate action to slow memory growth based on the pressure reported by that monitor. As istio-proxy runs with multiple processes inside it, it can be difficult to determine the correct value at which to set the max heap size. If we set it too low, we leave resources on the table and prematurely throttle. Worse, if we set it too high, we risk envoy and the container being OOM killed before our throttling kicks in at all.

Linux has a feature called Pressure Stall Information (or “PSI” for short) that quantifies overcommitted resources (including CPU, memory, and I/O) in terms of how long processes are stalled waiting for those resources. This PSI, as you can see, is also available per cgroup in cgroup v2. Having a resource monitor that tracks this memory pressure would allow us to dynamically understand when we’re approaching a memory threshold for the container rather than setting static thresholds and hoping they’re correct.

I understand adding a new resource monitor, especially one that has dependencies on something as specific as cgroup v2, will likely require discussion, even before design and implementation, so I wanted to open this issue to begin that discussion of whether the community would be open to adding this as a resource monitor in Envoy.

tyxia commented 1 month ago

add @nezdolik and @KBaichoo , as domain/code owner

nezdolik commented 1 month ago

This sounds like a useful feature (which will not be available for Osx users). What would be the reasonable action for Envoy process/new resource monitor if another process within container/VM is consuming too much memory? When using this PSI stat one cannot deduce which exact process is causing memory stall for other processes, correct? Also, please get familiar with our extension policy: https://github.com/envoyproxy/envoy/blob/main/EXTENSION_POLICY.md#adding-new-extensions

ryansmick commented 1 month ago

Correct, the PSI stats don't distinguish between processes, so you don't know which process is stalled or which process is causing the stall. But with that said, I think in a lot of cases where resources are shared, which process is causing the stall is open to interpretation. It depends on how much memory you expect each process to be using.

Personally, in our use-case envoy is the main driver of memory growth, and specifically upstream and downstream connections. The other process in the cgroup uses a relatively static amount of memory. So we would set our overload actions to limit envoys memory growth, perhaps using stop_accepting_connections or a similar action. That way, we could continue to accept requests on the connections we do have, while limiting new connections in an attempt to prevent envoy from being OOM killed.

For non-istio users, they likely have a bit more freedom: they could split processes into separate cgroups to get PSI stats for individual processes, they could implement the equivalent of overload manager in their other processes to also take action as processes begin to stall on memory, etc. Since resource monitors are decoupled from the overload actions, users have flexibility in terms of the policies they apply during overload.

KBaichoo commented 4 weeks ago

Can you help me understand exactly when a task would end up stalling on memory from the PSI interface? e.g. does this mean swap kicks? OOM killer kicks in? the task is blocked by page faulting?

it can be difficult to determine the correct value at which to set the max heap size. If we set it too low, we leave resources on the table and prematurely throttle. Worse, if we set it too high, we risk envoy and the container being OOM killed before our throttling kicks in at all.

What if you instead created a cgroup based heap monitor that would dynamically adjust to the resources provides to the group -- by reading the value itself and providing some fixed percent of that allocated for envoy to react.

I'm not an expert in memory PSI, a downsides I see is the following: the granularity seems to be avg 10, 60 and 300 seconds which might be too coarse for sharp spikes in resource usage

100% agree that the fixed heap monitor is lacking for some usecases :)

ryansmick commented 3 weeks ago

Thanks for the discussion @KBaichoo !

Can you help me understand exactly when a task would end up stalling on memory from the PSI interface?

From this LWN article and this LWN article, the PSI interface tracks the amount of time tasks are stalled because the CPU is doing housekeeping of memory rather than productive work on the task. From what I've read in these articles and others, this housekeeping primarily entails page refaults, meaning that a page is dropped from physical memory due to memory pressure (and put in swap space or just written back to disk in the case of a file-backed page) and then quickly faulted back into physical memory because it's needed by a task.

What if you instead created a cgroup based heap monitor that would dynamically adjust to the resources provides to the group -- by reading the value itself and providing some fixed percent of that allocated for envoy to react.

I think this would definitely be useful for envoy users, especially those that are not running cgroup v2 and don't even have the option to use PSI. The reason I think PSI may be a superior option for cgroup v2 users is because with PSI you actually get quantitative metrics on the memory pressure the system/cgroup is experiencing. If we just look at the cgroup limit and have envoy react at some percentage of that, it would require load testing to ensure we aren't setting the reaction threshold too low (thus leaving resources unused) or too high (thus seeing significant stalls/latency impact before we react). I think using PSI will help cut down on the up-front load testing requirements as it actually tracks the stalls; users just have to decide how much stalling they're willing to accept before reacting.

A real-world example of using PSI to react as the system becomes stalled on memory is Facebook's OOMd. It's a userspace OOM killer that attempts to kill a memory-hogging process before the kernelspace OOM killer does. You can see some of the configuration here uses memory PSI metrics.

If you want to hear more about why/how PSI may be useful, you can check out this talk from Chris Down (relevant section is 19:23-26:30). He's a linux kernel developer and SRE at Meta who contributed to the development of PSI. In the section of the video I mentioned above, he discusses why PSI is useful and mentions how Meta uses it internally in a similar way to what's being proposed here.

a downsides I see is the following: the granularity seems to be avg 10, 60 and 300 seconds which might be too coarse for sharp spikes in resource usage

There is a report here that the fourth column, total (which is the amount of time in microseconds that tasks in the cgroup have been stalled on memory), is updated at a very high frequency. If that is truly the case, one possible implementation would be to poll this file every refresh_interval and subtract the new total value from the previously recorded total value.

With all of this said, I have done a lot of reading on cgroup v2 and PSI, but I don't have much hands-on experience. It sounds like this may be the case for this group as a whole (and please correct me if I'm wrong :) ), so I'll take some time this week to experiment with it, put a quick test bench together, and attempt to make it clear with data how I think a PSI monitor can help.

KBaichoo commented 2 weeks ago

Thank you @ryansmick for all of the resources and pointers (especially the timestamped section!). This is a good idea. I agree that on the surface this looks more user friendly than trying to find the “correct” limits or trying to handle burstiness of memory usage if going above the high threshold doesn’t actually cause any system problems.

One aspect that I don’t yet have an intuitive understanding of is “how much time stalling due to memory pressure” is too much.

I look forward to learning more about your progress on this :).

envoyproxy / envoy

Add a Resource Monitor in Overload Manager that Tracks Memory PSI #36681