envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.89k stars 4.79k forks source link

Add a Resource Monitor in Overload Manager that Tracks Memory PSI #36681

Open ryansmick opened 1 day ago

ryansmick commented 1 day ago

Title: Add a Resource Monitor in Overload Manager that Tracks Memory PSI

Description: I’m currently using Envoy as part of Istio, and we are attempting to use Overload Manager to prevent envoy from using too much memory which could result in having the istio-proxy container OOM killed.

The method we’re currently evaluating is to track memory via the fixed_heap monitor, and take appropriate action to slow memory growth based on the pressure reported by that monitor. As istio-proxy runs with multiple processes inside it, it can be difficult to determine the correct value at which to set the max heap size. If we set it too low, we leave resources on the table and prematurely throttle. Worse, if we set it too high, we risk envoy and the container being OOM killed before our throttling kicks in at all.

Linux has a feature called Pressure Stall Information (or “PSI” for short) that quantifies overcommitted resources (including CPU, memory, and I/O) in terms of how long processes are stalled waiting for those resources. This PSI, as you can see, is also available per cgroup in cgroup v2. Having a resource monitor that tracks this memory pressure would allow us to dynamically understand when we’re approaching a memory threshold for the container rather than setting static thresholds and hoping they’re correct.

I understand adding a new resource monitor, especially one that has dependencies on something as specific as cgroup v2, will likely require discussion, even before design and implementation, so I wanted to open this issue to begin that discussion of whether the community would be open to adding this as a resource monitor in Envoy.

tyxia commented 19 hours ago

add @nezdolik and @KBaichoo , as domain/code owner

nezdolik commented 15 hours ago

This sounds like a useful feature (which will not be available for Osx users). What would be the reasonable action for Envoy process/new resource monitor if another process within container/VM is consuming too much memory? When using this PSI stat one cannot deduce which exact process is causing memory stall for other processes, correct? Also, please get familiar with our extension policy: https://github.com/envoyproxy/envoy/blob/main/EXTENSION_POLICY.md#adding-new-extensions