Kubernetes cadvisor reduces value proposition of OpenJ9's Shared Class Cache

edrevo commented 3 years ago

Sorry in advance if this is not the appropriate channel to raise this issue. Let me know if there is a better channel and I can move the conversation there.

My project has been using OpenJ9 for a few months primarily due to the smaller memory footprint compared to GraalVM or plain Hotspot. This is an importante feature for us, since we are developing microservices that run on Kubernetes. We are also agressively using OpenJ9's AOT and SCC to reduce further the memory footprint. We mount a shared volume in each of our nodes for the SCC, and all pods use that same SCC.

The results are great! One of my processes has a RSS of 774MB, but a PSS of only 520MB. Nice!

However... Kubernetes isn't using those metrics to monitor the pod's memory. Kubernetes delegates gathering the memory metrics to a project called cadvisor (https://github.com/google/cadvisor) which sends 2 revelant metrics to Kubernetes in order to determine how much memory a pod is using: memory_working_set and memory_rss (please note, the definition of these metrics do not match the linux definition of WSS and RSS). The memory_rss metric is fine, since it doesn't take into account memory mapped files, so I'll focus on memory_working_set.

memory_working_set is a metric that is calculated as follows: usage - total_inactive_file (where usage is the value found in /sys/fs/cgroup/memory/memory.usage_in_bytes and total_inactive_file is the value found in /sys/fs/cgroup/memory/memory.stat). In my process, usage is 675MB and total_inactive_file is just 15MB, since is only accounts for the portion of memory_mapped file that hasn't been accessed for a while (see https://lwn.net/Articles/432224/). This means that memory_working_set is 660MB, which is 140MB higher than my PSS! So I am wasting 140MB per pod/container, which adds up to a whole lot of memory.

I am not sure if there's anything that can be done on OpenJ9's side to improve this scenario on Kubernetes, but I definitely wanted to let you know this since I believe this hurts OpenJ9's value proposition for the SCC in cloud scenarios. It might be worth working with the cadvisor team to fix this or thinking of a way to mitigate this in OpenJ9's side. 140MB of potential RAM savings per pod is a lot!

edrevo commented 3 years ago

DanHeidinga commented 3 years ago

@pshipton @hangshao0 @vijaysun-omr @mpirvu for awareness

vijaysun-omr commented 3 years ago

fyi @tajila

hangshao0 commented 3 years ago

memory_working_set is a metric that is calculated as follows: usage - total_inactive_file

Talked to @vijaysun-omr on slack, we do not have control of how Kubernetes calculates the memory metric, from the OpenJ9's side, what we can do is to look at whether there are some APIs we can call (or change) to improve the above numbers.

ninja- commented 3 years ago

I think that's expected behaviour. Projects like cAdvisor or VPA never officialy support scaling JVMs. The kubernetes bug you linked mostly talks about reclaimable memory which isn't the same as shared class cache that is in use.

Please consider that for example PSS can't be known in advance because it depends on different factors, like number of pods running on machine or version of the shared class cache file that could be regenerated at some point in future.

You should account for that manually by using memory limit close to RSS and memory request close to PSS.

eclipse-openj9 / openj9

Kubernetes cadvisor reduces value proposition of OpenJ9's Shared Class Cache #11070