Open Jianlin-lv opened 2 months ago
Hello, thanks for the detailed report. The fact that you see a memory increase on loading tracing policies is normal behavior. If you are looking at container_memory_working_set_bytes
, depending on what you use for control groups (v1 or v2), it also accounts for the BPF maps memory usage we load from the TracingPolicy, I imagine in your case it might be v2. On the agent side we also need bits of code that allocates memory that you don't use if you don't use any policy.
That said, I've been working on tracking memory consumption of Tetragon and trying to avoid unnecessary memory waste.
On your heap dumps you can see the biggest consumption post is:
github.com/cilium/tetragon/pkg/process.initProcessInternalExec
Which is the process cache, I'm currently working on fixing a potential issue we have there: having a cache that grows too much compared to the actual process running on the host. That might enable Tetragon to overall consume less memory.
Hi @mtardy , that's exciting! We're seeing something similar to @Jianlin-lv as well where our memory seems to grow unbounded. Our heap dump does also have the process cache as the largest consumer, and our workloads are ephemeral by nature (lots of pod churns), so I'm quite curious about your point on having a process cache that grows too much. I have some questions regarding your work:
- Would / should we reduce the process cache size as a way to limit it from growing too much? My guess is probably not because IF you do have that many processes, you would want to have them in cache...
- What are some issues you see with the current process cache?
Let me answer both questions here: theoretically, the process cache size should be in line with the number of processes currently running on the host. So a lot of different situations can happen here depending on what you do with your host, but in a general case, it should eventually be stable and pretty low in size.
The issue we see is that in some situations, the process cache fills pretty quickly while the number of processes on the host is under a few hundred. We are currently merging work that will allow us to diagnose on a running agent what's happening in the cache https://github.com/cilium/tetragon/pull/2246.
Eventually, if people have very different needs (running tetragon at scale on a host with hundreds of thousands of processes), we are open to implementing tuning on the sizing of cache and BPF maps.
What would lead to high cache memory? Our memory metrics for tetragon shows high cache memory (2GB+) but relatively low RSS (~500MB). We tried forcing go GCing as well as triggering proc/sys/vm/drop_caches to try to reclaim memory, but cache size remains increasingly high.
Here it depends on what we are talking about exactly. Generally, Tetragon has two big posts of memory consumption if you look at memory.current
in mem cgroup v2:
GOMEMLIMIT
can also help reduce that amount while keeping performance). What happens in reality is that if your system is not under pressure, some memory might not have been reclaimed by the OS and your total consumption, in my example could be around 500MB.
Now, if you are seeing that the heap impact on the system is x4 the actual heap allocation, I think it would be worth investigating what's happening, looking at kernel vs anonymous, then the memory segments, then Go memstat, then a mem profile (I have some notes on understanding memory here).memory
on the release page). With many policies loaded, this typically occupies about ~20% of the total memory used (this estimate really depend ds on what you do with Tetragon).We merged this patch which should help on base heap (thus RSS) use:
We also merged this, which should help to understand memory use by process cache issues (the thing you are seeing as github.com/cilium/tetragon/pkg/process.initProcessInternalExec
on your initial report):
What happened?
In my test environment, applied two TracingPolicy and observed an increasing trend in tetragon's consumption of rss memory.
Enabled Pprof try to figure out which part consume the most memory Comparison before and after the two samples, the process initProcessInternalExec, tracing. HandleMsgGenericKprobe, namespace. GetMsgNamespaces, Caps. GetCapabilitiesTypes has increased consumption of memory.
I'm not sure if this is the desired behavior or if there is a memory leak.
TracingPolicy
Tetragon Version
v1.1.2
Kernel Version
ubuntu 22.04 , kernel 5.15.0-26
Kubernetes Version
No response
Bugtool
No response
Relevant log output
No response
Anything else?
No response