Open ryansmick opened 1 day ago
add @nezdolik and @KBaichoo , as domain/code owner
This sounds like a useful feature (which will not be available for Osx users). What would be the reasonable action for Envoy process/new resource monitor if another process within container/VM is consuming too much memory? When using this PSI stat one cannot deduce which exact process is causing memory stall for other processes, correct? Also, please get familiar with our extension policy: https://github.com/envoyproxy/envoy/blob/main/EXTENSION_POLICY.md#adding-new-extensions
Title: Add a Resource Monitor in Overload Manager that Tracks Memory PSI
Description: I’m currently using Envoy as part of Istio, and we are attempting to use Overload Manager to prevent envoy from using too much memory which could result in having the istio-proxy container OOM killed.
The method we’re currently evaluating is to track memory via the fixed_heap monitor, and take appropriate action to slow memory growth based on the pressure reported by that monitor. As istio-proxy runs with multiple processes inside it, it can be difficult to determine the correct value at which to set the max heap size. If we set it too low, we leave resources on the table and prematurely throttle. Worse, if we set it too high, we risk envoy and the container being OOM killed before our throttling kicks in at all.
Linux has a feature called Pressure Stall Information (or “PSI” for short) that quantifies overcommitted resources (including CPU, memory, and I/O) in terms of how long processes are stalled waiting for those resources. This PSI, as you can see, is also available per cgroup in cgroup v2. Having a resource monitor that tracks this memory pressure would allow us to dynamically understand when we’re approaching a memory threshold for the container rather than setting static thresholds and hoping they’re correct.
I understand adding a new resource monitor, especially one that has dependencies on something as specific as cgroup v2, will likely require discussion, even before design and implementation, so I wanted to open this issue to begin that discussion of whether the community would be open to adding this as a resource monitor in Envoy.