Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.95k stars 304 forks source link

[Question] What is Consuming so much RAM in 1.25.x? #3715

Open chriscardillo opened 1 year ago

chriscardillo commented 1 year ago

Describe scenario Running a single-node Standard_B2s cluster (v1.25.6), with the following deployed:

The total memory currently consumed by all of my pods (kubectl top pods --sum=true -A) is 570Mi.

The total memory currently consumed by my node (kubectl top no) is 2164Mi, and this is reading as 100% of available my memory.

Question

Per this article, even if I should only expect ~66% of my provisioned RAM (in this case, around 2.5 GB), my pods are only consuming 570Mi, so what is consuming the other 1.5GB RAM here (out of the 2164Mi total)?

chriscardillo commented 1 year ago

Additionally, here is a screenshot of the top of top from a debug container in the single node from earlier today:

image

Feels pretty light, no?

motizukilucas commented 1 year ago

Been having similar issues, after upgrading also

It appears that there's some huge difference on OS reported memory usage and kubectl top no

But yet, this appears to have an impact on performance

alexeldeib commented 1 year ago

But yet, this appears to have an impact on performance

an impact, or no impact?

I think we need to compare exactly what's being measured in each case. metrics-server should be looking at container_memory_working_set_bytes. This should come from kubelet -> cadvisor -> cgroupv2 memory.stat file.

Could you try comparing what you see in metrics server with the memory.stat file for the pod cgroup before/after?

e.g. for v1

cat /sys/fs/cgroup/memory/kubepods/$QOS_CLASS/pod$UID/memory.stat

e.g. for v2

cat /sys/fs/cgroup/kubepods.slice/$QOS_CLASS.slice/pod$UID.scope/memory.stat

or similar?

motizukilucas commented 1 year ago

But yet, this appears to have an impact on performance

an impact, or no impact?

I think we need to compare exactly what's being measured in each case. metrics-server should be looking at container_memory_working_set_bytes. This should come from kubelet -> cadvisor -> cgroupv2 memory.stat file.

Could you try comparing what you see in metrics server with the memory.stat file for the pod cgroup before/after?

e.g. for v1

cat /sys/fs/cgroup/memory/kubepods/$QOS_CLASS/pod$UID/memory.stat

e.g. for v2

cat /sys/fs/cgroup/kubepods.slice/$QOS_CLASS.slice/pod$UID.scope/memory.stat

or similar?

Hello, I meant an impact

I can't really compare before and after, as I understand it is not possible to revert an upgrade on AKS, if I'm wrong please advise

What I observe is the following:

Nodepool memory before was around 70% to 80%

After upgrade it sits around 100%-110%, and we have noticed some impact at the services running in the cluster

If I add up all memory used from pods in that nodepool the sum is an X amount, while kubectl top no reports 4 times X, this can be very misleading

motizukilucas commented 1 year ago

After reading this original issue from which @chriscardillo first posted his concerns, I got the following questions regarding kubernetes version and region

So I created 3 clusters to test the outcome, they were created one right after the other and neither have anything deployed other than what already comes with AKS by default

What I found out is that in fact 1.24.x (in this case 1.24.9) consumes less memory by default, at least as reported by kubectl top no, and 1.26.3 is using more memory in EU region

Screenshot 2023-06-19 at 09 43 49

Screenshot 2023-06-19 at 10 18 06

I'm still trying to collect more information and piece together this scene, but I thought this was interesting enough and worth a post

alexeldeib commented 1 year ago

I think this is an issue with cadvisor collection for root cgroup statistics.

Background

kubelet uses cadvisor for node-level statistics (whether it uses cri or cadvisor stats provider for pods): https://github.com/kubernetes/kubernetes/blob/3c380199e9895efb47c6ad9035b2438a0a8bce5e/pkg/kubelet/stats/provider.go#L42-L70

kubectl top node or kubectl get --raw /api/v1/nodes/foo/proxy/stats/summary | jq -C .node.memory show higher values than the same cluster config + pods with cgroupv1.

kubectl top node uses node working set bytes: https://github.com/kubernetes-sigs/metrics-server/blob/5daafd91f74725f21f543c31682ecc3e3838a126/pkg/scraper/client/resource/decode.go#L34

kubelet stats provider uses cadvisor to get the cgroup stats: https://github.com/kubernetes/kubernetes/blob/48a6fb0c428440fc43dfb2fb4ea707fac9dd60b9/pkg/kubelet/server/stats/summary.go#L130-L133 https://github.com/kubernetes/kubernetes/blob/48a6fb0c428440fc43dfb2fb4ea707fac9dd60b9/pkg/kubelet/stats/helper.go#L290-L295

Skipping some caching indirection in cadvisor, we eventually get a cgroup manager which is different based on cgroup v1/v2 https://github.com/google/cadvisor/blob/8164b38067246b36c773204f154604e2a1c962dc/container/libcontainer/helpers.go#L164-L169

these implementations differ in their calculation of memory usage for the root cgroup

v1 uses memory usage from memory.usage_in_bytes https://github.com/opencontainers/runc/blob/92c71e725fc6421b6375ff128936a23c340e2d16/libcontainer/cgroups/fs/memory.go#L204-L224

v2 uses /proc/meminfo and calculates usage as total - free (!!! this is key): https://github.com/opencontainers/runc/blob/92c71e725fc6421b6375ff128936a23c340e2d16/libcontainer/cgroups/fs2/memory.go#L217

usage_in_bytes is roughly RSS + Cache. working set is usage - inactive file.

back in cadvisor, we drop inactive_file for working set: https://github.com/google/cadvisor/blob/8164b38067246b36c773204f154604e2a1c962dc/container/libcontainer/handler.go#L835-L844

Hypothesis

my hypothesis of the problem: using total - free counts inactive_anon as part of usage, which the kernel will not count: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/memcontrol.c#n3720 (seems to only count NR_ANON_MAPPED, not NR_INACTIVE_ANON (?))

In my test nodes, this almost exactly accounts for the diff I see.


forgive some minor discrepancies in precise numbers, these measurements were taken a few seconds apart, but should highlight the issue.


cadvisor cgroupv1

~ # kubectl --context ace-v1 top node
NAME                                CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
aks-nodepool1-57822035-vmss000000   98m          2%     1512Mi          12%
aks-nodepool1-57822035-vmss000001   99m          2%     1454Mi          11%
aks-nodepool1-57822035-vmss000002   94m          2%     1448Mi          11%
~ # cat /sys/fs/cgroup/memory/memory.usage_in_bytes
6236864512
~ # cat /sys/fs/cgroup/memory/memory.stat
cache 44662784
rss 3260416
rss_huge 2097152
shmem 65536
mapped_file 11083776
dirty 135168
writeback 0
pgpgin 114774
pgpgout 103506
pgfault 165891
pgmajfault 99
inactive_anon 135168
active_anon 3645440
inactive_file 5406720
active_file 39333888
unevictable 0
hierarchical_memory_limit 9223372036854771712
total_cache 5471584256
total_rss 767148032
total_rss_huge 559939584
total_shmem 1921024
total_mapped_file 605687808
total_dirty 270336
total_writeback 0
total_pgpgin 51679194
total_pgpgout 50291069
total_pgfault 97383769
total_pgmajfault 5610
total_inactive_anon 1081344
total_active_anon 772235264
total_inactive_file 4648124416
total_active_file 820551680
total_unevictable 0

actual calculation

memory.current - memory.stat.total_inactive_file = 6236864512 - 4648124416 = 1515 Mi -> reported by kubelet

"what if" cadvisor used v2 logic on this node

(see bottom for /proc/meminfo from both nodes)

usage = total - free = 16393244 - 9744148 = 6649096 Ki -> fairly close to above, 10s of Mi off. working_set = total - free - inactive_file = 16393244 - 9744148 - 4525876 = 2207 Mi

proposed_usage = total - free = 16393244 - 9744148 - 792 = 6648304 Ki working_set = total - free - inactive_file - inactive_anon = 16393244 - 9744148 - 4525876 - 792 = 2072 Mi -> matches v2 usage roughly

cadvisor cgroupv2

~ # kubectl --context ace-v2 top node
NAME                                CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
aks-nodepool1-38234323-vmss000000   113m         2%     2196Mi          17%
aks-nodepool1-38234323-vmss000001   112m         2%     2171Mi          17%
aks-nodepool1-38234323-vmss000002   113m         2%     2180Mi          17%

(see bottom for /proc/meminfo from both nodes)

usage = total - free = 16374584 - 9505980 working_set = total - free - inactive_file = 16374584 - 9505980 - 4608000 = 2207 Mi -> reported by kubelet

proposed correction

proposed_usage = total - free - inactive_anon = 16374584 - 9505980 - 791340 = 6077264 Ki proposed_working_set = total - free - inactive_file - inactive_anon = 16374584 - 9505980 - 4608000 - 791340 = 1434 Mi -> matches v1 usage roughly


this is also consistent with pod level metrics working, while node level data is wrong (it only affects the root cgroup due to v2 behavior).

hoping to get a sanity check on that before moving forward. posted a thread in k8s slack fyi: https://kubernetes.slack.com/archives/C0BP8PW9G/p1687901371004299


Potential fixes

Either one could work. The runc fix is likely "more correct", since cgroupv2 usage is overestimated. And less code :)

runc

diff --git a/libcontainer/cgroups/fs2/memory.go b/libcontainer/cgroups/fs2/memory.go
index e3b857dc..2f3c9fec 100644
--- a/libcontainer/cgroups/fs2/memory.go
+++ b/libcontainer/cgroups/fs2/memory.go
@@ -111,6 +111,7 @@ func statMemory(dirPath string, stats *cgroups.Stats) error {
                }
                return err
        }
+       memoryUsage.Usage = memoryUsage.Usage - stats.MemoryStats.Stats["inactive_anon"]
        stats.MemoryStats.Usage = memoryUsage
        swapUsage, err := getMemoryDataV2(dirPath, "swap")
        if err != nil {

cadvisor

diff --git a/container/libcontainer/handler.go b/container/libcontainer/handler.go
index 2c4709e2..1dfcea82 100644
--- a/container/libcontainer/handler.go
+++ b/container/libcontainer/handler.go
@@ -832,6 +832,14 @@ func setMemoryStats(s *cgroups.Stats, ret *info.ContainerStats) {
                inactiveFileKeyName = "inactive_file"
        }

+       // correct usage to exclude inactive_anon.
+       // this would otherwise be counted for v2 but not v1.
+       // v1 only counts anon_mapped for usage
+       // alternatively, we could correct only working set.
+       if cgroups.IsCgroup2UnifiedMode() {
+               ret.Memory.Usage = ret.Memory.Usage - s.MemoryStats.Stats["inactive_anon"]
+       }
+
        workingSet := ret.Memory.Usage
        if v, ok := s.MemoryStats.Stats[inactiveFileKeyName]; ok {
                if workingSet < v {

v2 meminfo

MemTotal:       16374584 kB
MemFree:         9505980 kB
MemAvailable:   14912544 kB
Buffers:          155164 kB
Cached:          5335576 kB
SwapCached:            0 kB
Active:           872420 kB
Inactive:        5399340 kB
Active(anon):       2568 kB
Inactive(anon):   791340 kB
Active(file):     869852 kB
Inactive(file):  4608000 kB
Unevictable:       30740 kB
Mlocked:           27668 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:               148 kB
Writeback:             0 kB
AnonPages:        716552 kB
Mapped:           608424 kB
Shmem:              6320 kB
KReclaimable:     274360 kB
Slab:             355976 kB
SReclaimable:     274360 kB
SUnreclaim:        81616 kB
KernelStack:        8064 kB
PageTables:         7692 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     8187292 kB
Committed_AS:    2605012 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       48092 kB
VmallocChunk:          0 kB
Percpu:             3472 kB
HardwareCorrupted:     0 kB
AnonHugePages:    409600 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:      271624 kB
DirectMap2M:     8116224 kB
DirectMap1G:    10485760 kB

v1 meminfo

/# cat /proc/meminfo
MemTotal:       16393244 kB
MemFree:         9744148 kB
MemAvailable:   15020900 kB
Buffers:          132344 kB
Cached:          5207356 kB
SwapCached:            0 kB
Active:          1557252 kB
Inactive:        4526668 kB
Active(anon):     745916 kB
Inactive(anon):      792 kB
Active(file):     811336 kB
Inactive(file):  4525876 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:               636 kB
Writeback:             0 kB
AnonPages:        618992 kB
Mapped:           624384 kB
Shmem:              2496 kB
KReclaimable:     285824 kB
Slab:             423600 kB
SReclaimable:     285824 kB
SUnreclaim:       137776 kB
KernelStack:        8400 kB
PageTables:         9060 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     8196620 kB
Committed_AS:    2800016 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       40992 kB
VmallocChunk:          0 kB
Percpu:             4432 kB
HardwareCorrupted:     0 kB
AnonHugePages:    270336 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:      302344 kB
DirectMap2M:     3891200 kB
DirectMap1G:    14680064 kB
motizukilucas commented 1 month ago

I believe this just might be an issue on reporting rather that memory consumption itself.

Try updating base OS to Ubuntu 22. It seemed that kubectl top, was more accurate then...