Open chriscardillo opened 1 year ago
Additionally, here is a screenshot of the top of top
from a debug container in the single node from earlier today:
Feels pretty light, no?
Been having similar issues, after upgrading also
It appears that there's some huge difference on OS reported memory usage and kubectl top no
But yet, this appears to have an impact on performance
But yet, this appears to have an impact on performance
an impact, or no impact?
I think we need to compare exactly what's being measured in each case. metrics-server should be looking at container_memory_working_set_bytes
. This should come from kubelet -> cadvisor -> cgroupv2 memory.stat file.
Could you try comparing what you see in metrics server with the memory.stat
file for the pod cgroup before/after?
e.g. for v1
cat /sys/fs/cgroup/memory/kubepods/$QOS_CLASS/pod$UID/memory.stat
e.g. for v2
cat /sys/fs/cgroup/kubepods.slice/$QOS_CLASS.slice/pod$UID.scope/memory.stat
or similar?
But yet, this appears to have an impact on performance
an impact, or no impact?
I think we need to compare exactly what's being measured in each case. metrics-server should be looking at
container_memory_working_set_bytes
. This should come from kubelet -> cadvisor -> cgroupv2 memory.stat file.Could you try comparing what you see in metrics server with the
memory.stat
file for the pod cgroup before/after?e.g. for v1
cat /sys/fs/cgroup/memory/kubepods/$QOS_CLASS/pod$UID/memory.stat
e.g. for v2
cat /sys/fs/cgroup/kubepods.slice/$QOS_CLASS.slice/pod$UID.scope/memory.stat
or similar?
Hello, I meant an impact
I can't really compare before and after, as I understand it is not possible to revert an upgrade on AKS, if I'm wrong please advise
What I observe is the following:
Nodepool memory before was around 70% to 80%
After upgrade it sits around 100%-110%, and we have noticed some impact at the services running in the cluster
If I add up all memory used from pods in that nodepool the sum is an X amount, while kubectl top no
reports 4 times X, this can be very misleading
After reading this original issue from which @chriscardillo first posted his concerns, I got the following questions regarding kubernetes version and region
So I created 3 clusters to test the outcome, they were created one right after the other and neither have anything deployed other than what already comes with AKS by default
What I found out is that in fact 1.24.x (in this case 1.24.9) consumes less memory by default, at least as reported by kubectl top no, and 1.26.3 is using more memory in EU region
I'm still trying to collect more information and piece together this scene, but I thought this was interesting enough and worth a post
I think this is an issue with cadvisor collection for root cgroup statistics.
kubelet uses cadvisor for node-level statistics (whether it uses cri or cadvisor stats provider for pods): https://github.com/kubernetes/kubernetes/blob/3c380199e9895efb47c6ad9035b2438a0a8bce5e/pkg/kubelet/stats/provider.go#L42-L70
kubectl top node
or kubectl get --raw /api/v1/nodes/foo/proxy/stats/summary | jq -C .node.memory
show higher values than the same cluster config + pods with cgroupv1.
kubectl top node
uses node working set bytes: https://github.com/kubernetes-sigs/metrics-server/blob/5daafd91f74725f21f543c31682ecc3e3838a126/pkg/scraper/client/resource/decode.go#L34
kubelet stats provider uses cadvisor to get the cgroup stats: https://github.com/kubernetes/kubernetes/blob/48a6fb0c428440fc43dfb2fb4ea707fac9dd60b9/pkg/kubelet/server/stats/summary.go#L130-L133 https://github.com/kubernetes/kubernetes/blob/48a6fb0c428440fc43dfb2fb4ea707fac9dd60b9/pkg/kubelet/stats/helper.go#L290-L295
Skipping some caching indirection in cadvisor, we eventually get a cgroup manager which is different based on cgroup v1/v2 https://github.com/google/cadvisor/blob/8164b38067246b36c773204f154604e2a1c962dc/container/libcontainer/helpers.go#L164-L169
these implementations differ in their calculation of memory usage for the root cgroup
v1 uses memory usage from memory.usage_in_bytes
https://github.com/opencontainers/runc/blob/92c71e725fc6421b6375ff128936a23c340e2d16/libcontainer/cgroups/fs/memory.go#L204-L224
v2 uses /proc/meminfo
and calculates usage as total - free
(!!! this is key):
https://github.com/opencontainers/runc/blob/92c71e725fc6421b6375ff128936a23c340e2d16/libcontainer/cgroups/fs2/memory.go#L217
usage_in_bytes is roughly RSS + Cache. working set is usage - inactive file.
back in cadvisor, we drop inactive_file for working set: https://github.com/google/cadvisor/blob/8164b38067246b36c773204f154604e2a1c962dc/container/libcontainer/handler.go#L835-L844
my hypothesis of the problem: using total - free
counts inactive_anon
as part of usage, which the kernel will not count: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/memcontrol.c#n3720 (seems to only count NR_ANON_MAPPED, not NR_INACTIVE_ANON (?))
In my test nodes, this almost exactly accounts for the diff I see.
forgive some minor discrepancies in precise numbers, these measurements were taken a few seconds apart, but should highlight the issue.
~ # kubectl --context ace-v1 top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
aks-nodepool1-57822035-vmss000000 98m 2% 1512Mi 12%
aks-nodepool1-57822035-vmss000001 99m 2% 1454Mi 11%
aks-nodepool1-57822035-vmss000002 94m 2% 1448Mi 11%
~ # cat /sys/fs/cgroup/memory/memory.usage_in_bytes
6236864512
~ # cat /sys/fs/cgroup/memory/memory.stat
cache 44662784
rss 3260416
rss_huge 2097152
shmem 65536
mapped_file 11083776
dirty 135168
writeback 0
pgpgin 114774
pgpgout 103506
pgfault 165891
pgmajfault 99
inactive_anon 135168
active_anon 3645440
inactive_file 5406720
active_file 39333888
unevictable 0
hierarchical_memory_limit 9223372036854771712
total_cache 5471584256
total_rss 767148032
total_rss_huge 559939584
total_shmem 1921024
total_mapped_file 605687808
total_dirty 270336
total_writeback 0
total_pgpgin 51679194
total_pgpgout 50291069
total_pgfault 97383769
total_pgmajfault 5610
total_inactive_anon 1081344
total_active_anon 772235264
total_inactive_file 4648124416
total_active_file 820551680
total_unevictable 0
memory.current - memory.stat.total_inactive_file = 6236864512 - 4648124416 = 1515 Mi -> reported by kubelet
(see bottom for /proc/meminfo
from both nodes)
usage = total - free = 16393244 - 9744148 = 6649096 Ki -> fairly close to above, 10s of Mi off. working_set = total - free - inactive_file = 16393244 - 9744148 - 4525876 = 2207 Mi
proposed_usage = total - free = 16393244 - 9744148 - 792 = 6648304 Ki working_set = total - free - inactive_file - inactive_anon = 16393244 - 9744148 - 4525876 - 792 = 2072 Mi -> matches v2 usage roughly
~ # kubectl --context ace-v2 top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
aks-nodepool1-38234323-vmss000000 113m 2% 2196Mi 17%
aks-nodepool1-38234323-vmss000001 112m 2% 2171Mi 17%
aks-nodepool1-38234323-vmss000002 113m 2% 2180Mi 17%
(see bottom for /proc/meminfo
from both nodes)
usage = total - free = 16374584 - 9505980 working_set = total - free - inactive_file = 16374584 - 9505980 - 4608000 = 2207 Mi -> reported by kubelet
proposed_usage = total - free - inactive_anon = 16374584 - 9505980 - 791340 = 6077264 Ki proposed_working_set = total - free - inactive_file - inactive_anon = 16374584 - 9505980 - 4608000 - 791340 = 1434 Mi -> matches v1 usage roughly
this is also consistent with pod level metrics working, while node level data is wrong (it only affects the root cgroup due to v2 behavior).
hoping to get a sanity check on that before moving forward. posted a thread in k8s slack fyi: https://kubernetes.slack.com/archives/C0BP8PW9G/p1687901371004299
Either one could work. The runc fix is likely "more correct", since cgroupv2 usage is overestimated. And less code :)
diff --git a/libcontainer/cgroups/fs2/memory.go b/libcontainer/cgroups/fs2/memory.go
index e3b857dc..2f3c9fec 100644
--- a/libcontainer/cgroups/fs2/memory.go
+++ b/libcontainer/cgroups/fs2/memory.go
@@ -111,6 +111,7 @@ func statMemory(dirPath string, stats *cgroups.Stats) error {
}
return err
}
+ memoryUsage.Usage = memoryUsage.Usage - stats.MemoryStats.Stats["inactive_anon"]
stats.MemoryStats.Usage = memoryUsage
swapUsage, err := getMemoryDataV2(dirPath, "swap")
if err != nil {
diff --git a/container/libcontainer/handler.go b/container/libcontainer/handler.go
index 2c4709e2..1dfcea82 100644
--- a/container/libcontainer/handler.go
+++ b/container/libcontainer/handler.go
@@ -832,6 +832,14 @@ func setMemoryStats(s *cgroups.Stats, ret *info.ContainerStats) {
inactiveFileKeyName = "inactive_file"
}
+ // correct usage to exclude inactive_anon.
+ // this would otherwise be counted for v2 but not v1.
+ // v1 only counts anon_mapped for usage
+ // alternatively, we could correct only working set.
+ if cgroups.IsCgroup2UnifiedMode() {
+ ret.Memory.Usage = ret.Memory.Usage - s.MemoryStats.Stats["inactive_anon"]
+ }
+
workingSet := ret.Memory.Usage
if v, ok := s.MemoryStats.Stats[inactiveFileKeyName]; ok {
if workingSet < v {
v2 meminfo
MemTotal: 16374584 kB
MemFree: 9505980 kB
MemAvailable: 14912544 kB
Buffers: 155164 kB
Cached: 5335576 kB
SwapCached: 0 kB
Active: 872420 kB
Inactive: 5399340 kB
Active(anon): 2568 kB
Inactive(anon): 791340 kB
Active(file): 869852 kB
Inactive(file): 4608000 kB
Unevictable: 30740 kB
Mlocked: 27668 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 148 kB
Writeback: 0 kB
AnonPages: 716552 kB
Mapped: 608424 kB
Shmem: 6320 kB
KReclaimable: 274360 kB
Slab: 355976 kB
SReclaimable: 274360 kB
SUnreclaim: 81616 kB
KernelStack: 8064 kB
PageTables: 7692 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 8187292 kB
Committed_AS: 2605012 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 48092 kB
VmallocChunk: 0 kB
Percpu: 3472 kB
HardwareCorrupted: 0 kB
AnonHugePages: 409600 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
FileHugePages: 0 kB
FilePmdMapped: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 271624 kB
DirectMap2M: 8116224 kB
DirectMap1G: 10485760 kB
v1 meminfo
/# cat /proc/meminfo
MemTotal: 16393244 kB
MemFree: 9744148 kB
MemAvailable: 15020900 kB
Buffers: 132344 kB
Cached: 5207356 kB
SwapCached: 0 kB
Active: 1557252 kB
Inactive: 4526668 kB
Active(anon): 745916 kB
Inactive(anon): 792 kB
Active(file): 811336 kB
Inactive(file): 4525876 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 636 kB
Writeback: 0 kB
AnonPages: 618992 kB
Mapped: 624384 kB
Shmem: 2496 kB
KReclaimable: 285824 kB
Slab: 423600 kB
SReclaimable: 285824 kB
SUnreclaim: 137776 kB
KernelStack: 8400 kB
PageTables: 9060 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 8196620 kB
Committed_AS: 2800016 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 40992 kB
VmallocChunk: 0 kB
Percpu: 4432 kB
HardwareCorrupted: 0 kB
AnonHugePages: 270336 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
FileHugePages: 0 kB
FilePmdMapped: 0 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 302344 kB
DirectMap2M: 3891200 kB
DirectMap1G: 14680064 kB
I believe this just might be an issue on reporting rather that memory consumption itself.
Try updating base OS to Ubuntu 22. It seemed that kubectl top, was more accurate then...
Describe scenario Running a single-node Standard_B2s cluster (
v1.25.6
), with the following deployed:The total memory currently consumed by all of my pods (
kubectl top pods --sum=true -A
) is570Mi
.The total memory currently consumed by my node (
kubectl top no
) is2164Mi
, and this is reading as 100% of available my memory.Question
Per this article, even if I should only expect ~66% of my provisioned RAM (in this case, around 2.5 GB), my pods are only consuming
570Mi
, so what is consuming the other 1.5GB RAM here (out of the2164Mi
total)?