devopsprodigy / kubegraf

Grafana-plugin for k8s' monitoring
MIT License
407 stars 45 forks source link

container/pod memory usage not always real time usage #56

Closed gjemp closed 3 years ago

gjemp commented 3 years ago

Hi, we had a case where kubegraf panels that show pod/container memory usage should have triggered the autoscale or OOMKILL. None of them happened. We looked to the docker container memory usage and it was like 2 times lower, the we also looked kubectl data about the same and this was pretty similar to docker values ( it all depends what is the period and samples in period that are "compressed" to one sample max, min,, avg, current ). Since we , and many others we have asked , were thinking that the memory usage on memory panel is showing the so called real time current usage then actually after doing some digging and deep dive to different metrics that prom is using and by what kubernetes triggers the killing and scaling are different. The one , that is used in KubeGraf is the "all-in-one" memory usage that also contains cached data which is not the current usage. So my suggestion is that to have more accurate data with kubernetes container lifecycle we should change the container_memory_usage_bytes : a) calculation method from sum to avg/max b) use metric container_memory_working_set_bytes instead of container_memory_usage_bytes which is more accurate to indicate current usage without cache c) add another measurement to the panel that shows also the all-in-one value as Usage+cache

Dashboards that are related: Pods dashboard memory panel Deployments Dashboard memory panel Daemonsets Dashboard memory panel StatefulSets Dashboard memory panel

Node dashboard has way better explanation about the measurement and is more clear what is what. Nodes Overview has a correct data. But there is problem withnode Pods count - 2 times bigger then reality.

more detailed explanation can be found from here : https://blog.freshtracks.io/a-deep-dive-into-kubernetes-metrics-part-3-container-resource-metrics-361c5ee46e66

We were just really struggling to understand why it is not triggering OOMkill cause by panel stats it should. When our expectations about memory usage on the panel were wrong compared to the common understating and expectations on memory usage then we were just wrong.

SergeiSporyshev commented 3 years ago

Hi, @gjemp Thanks for your issue! Will investigate it

SergeiSporyshev commented 3 years ago

@gjemp I think, that changing container_memory_usage_bytes to container_memory_working_set_bytes is the best way

SergeiSporyshev commented 3 years ago

Hi, @gjemp

We fixed that in the our latest release - https://github.com/devopsprodigy/kubegraf/releases/tag/v1.5.2

gjemp commented 3 years ago

great :) but we have met another case where event that fix cant not explain by data why container gets OOMKill :) but this is not related to this , looks more related to coding language ( nodejs ) and the application lifecycle what is not so well translated to metrics . No peaks over the limits and it gets OOMKill :) But thanks for acting that fast :)