High cpu usage when running ListContainerStats

kinarashah commented 1 year ago

Seeing cpu spike to 200%+ for dockerd on a RKE1 1.24 cluster. 1.24 is same as 1.23 with cri-dockerd enabled so using 1.23 to observe difference between enabling and disabling cri-dockerd. https://github.com/rancher/rancher/issues/38816 has more context. Issue is easily observed without running any extra user workloads as well but a simple way to reproduce is:

Rancher v2.6.9 and RKE1 1.23.12 cluster custom downstream cluster: 1 cp+ep, 3 worker nodes
deploy and scale nginx deployment to 300 pods (~200 containers running on each node per docker info)
monitor top and output of kubectl get --raw /api/v1/nodes/$NODE_NAME/proxy/stats/summary
edit cluster to set enable_cri_dockerd as true and monitor cpu usage

I haven't been able to reproduce it on overlay2, but users have helped test it out.

ContainerStatsOneShot was introduced in v0.2.6 and I've run all tests on v0.2.6 to make sure it's accounted for. Would appreciate any help in resolving this, things I have tried so far -

Disabling cri-dockerd on <1.24 clusters shows minimal cpu usage, difference is evident on a single node too (CPUs: 4, Memory 15.61GiB):
- w/o cri-dockerd: <10%, while deploying new pods: goes up to ~50%, drops down back
- w cri-dockerd: 250%+, didn't deploy any new pods
There's a big difference in the cpu usage between docker storage drivers. Based on strace for overlay, it seems to be iterating/looping around content in/var/lib/rancher-data/local-catalogs/, repeatedly calling lgetxattr() and newfstatat(), which seemed to be coming from calculation for WriteableLayer.UsedBytes. Tried removing InspectContainerWithSize, but I am not sure what is the bottleneck in /containers/"+containerID+"/json?size=1 calls.
- overlay: w/o InspectContainerWithSize: <20%, adding new ~250 pods, goes up to ~30%
- no major change for overlay2
Looks like this is because earlier upstream k8s relied on cadvisor stats: https://github.com/kubernetes/kubernetes/blob/v1.22.16/pkg/kubelet/kubelet.go#L710. Is there a way to keep using cadvisor provider as a workaround? Changing code to force use cadvisor stats doesn't work with the new runtime changes. It fails at https://github.com/kubernetes/kubernetes/blob/v1.24.8/pkg/kubelet/stats/cadvisor_stats_provider.go#L84
Updating code to cache stats, I believe changes would need more refinement to manage add/remove container and force update when container id is not found...these are experimental changes only. But the number is still high enough without deploying any user pods, and is bound to get worse.

per container id with refreshPeriod=15s, and ListContainerStats only reads values from cached containerStats: no major improvement and cpu spike is observed every time, https://github.com/Mirantis/cri-dockerd/compare/master...kinarashah:cri-dockerd:opt?expand=1
- overlay: w container caching: 250%+
goroutines were introduced to resolve the metrics-server deadline exceeded issues, using cache+removing goroutines to see if there was an impact: https://github.com/Mirantis/cri-dockerd/compare/master...kinarashah:cri-dockerd:opt2?expand=1
- overlay: w container caching + no goroutine: <150%
cache entire result []*runtimeapi.ContainerStats with same refreshPeriod and list simply returns the stats list: https://github.com/Mirantis/cri-dockerd/compare/master...kinarashah:cri-dockerd:opt3?expand=1
- overlay: w stat list caching + no goroutine: <150%
If we go down this path, I think it would make sense to introduce ListPodSandboxStats and switch to strictly collecting stats from cri.

Combining changes from 3 and 4:
- overlay: container caching + no goroutine + w/o InspectContainerWithSize: <10% (same as w/o cri-dockerd)

Anything else I could try out? Thank you!

phoenix-bjoern commented 1 year ago

This issue makes it impossible to run K8S >=1.24 and is a blocker for users relying on the Dockershim. Mirantis promised to take over responsibility of the Dockershim (https://www.mirantis.com/blog/the-future-of-dockershim-is-cri-dockerd/) and it is simply not working out. So please, fix this issue urgently as people have to stay on 1.22/23 or migrate to containerd otherwise. // cc @evol262

neersighted commented 1 year ago

Hi @phoenix-bjoern. I don't work on cri-dockerd beyond providing issue triage, but I do want to address your statement, as I don't want others to be misled.

Mirantis is maintaining cri-dockerd in collaboration with Docker Inc. and the container community as a community-driven open source project. As I am sure you can infer, the two sets of named contributors are most interested in supporting their own products. While we are all engineers very interested in the container and Kubernetes space and we want to solve all the issues we come across, the reality is that there is limited time in the day and that the companies who write our paychecks get to dictate our priorities.

The performance regression in this issue appears to takes place in some Kubernetes distros due to a upstream change in Kubernetes. It does not affect all users, or indeed, the most prolific users of cri-dockerd. As such I expect that improvements will come from the Rancher folks (who work regularly on this component), from another member of the community who takes it upon themselves to solve this, or from the Mirantis Kubernetes Engine team, in order of timeliness.

Again, that is not to say that there is no want to investigate this; it is just to say that when allocating limited engineering resources, it will take some time to justify working on an issue that does not affect any of the Kubernetes deployments/distributions the regular contributors work on/are familiar with.

Bottom line up front: Mirantis maintains this repository for the community and its own products. That means that Mirantis is committed to delivering its own changes under an open source license, and that Mirantis is curating the issue tracker and providing code review/maintainer feedback as free cycles permit. It does not mean that Mirantis has pledged to fix every issue reported on a fixed timeline, or that you have a support/contractual relationship with Mirantis.

Please keep in mind this is not an official statement by any means, and that I am not part of any of the teams mentioned above.

evol262 commented 1 year ago

This appears to be almost exactly the same issue as #127. We'll see if we can find some kind of workaround.

The basic comment there is that trying to have a bare CRI inject (or deploy, anyway) a full cAdvisor stack to work around the fact that KEP 2371 is not in a good state, and the more or less official response from upstream is "either dramatically reduce the stats you collect, or deploy full-bore cAdvisor yourself" doesn't leave a lot of alternatives here.

We're happy to work with Rancher (and Mirantis will have to do something ourselves in Mirantis Kubernetes Engine, but that's honestly likely to be deploying cAdvisor sets), but there isn't a lot we can do technically other than writing a temporary layer that we'll have to throw away when/if 2371 is ever ready. The memoizing/caching patch in the original comment looks reasonable, and implementing ListPodSandboxStats is an ok fallback.

It's something we need to have an actual discussion about to see how much effort RKE wants to put into this (the best workaround would be to just deploy full-scale cAdvisor as part of k3s/RKE and stop relying on the kubelet at all, considering the stutter-stop state of the KEP).

memoizing/caching carries a very real possibility of getting results which do not reflect reality depending on the Prometheus scrape interval, especially for spikey workloads (kubeflow, for example)

phoenix-bjoern commented 1 year ago

An update from an affected Ubuntu 20.04 cluster: The high load went back to normal after upgrading the Kernel from 5.4.0-132 to 5.4.0-135. Not sure of the containerd fix (https://launchpad.net/ubuntu/+source/linux/5.4.0-135.152) has smth to do with it. docker stats, which didn't respond when the CPU peak was affecting the system, now works as usual. Maybe @kinarashah can test this in her setup to see if a Kernel update mitigates/fixes the problem for her as well.

kinarashah commented 1 year ago

@evol262 Since https://github.com/kubernetes/kubernetes/issues/106080#issuecomment-964845022 was before KEP status moved to beta, I assumed it was only a workaround for the time being. But looking at the status now, it does seem like this won't get implemented anytime soon because it will still be in alpha for 1.25.

I am okay with making the changes around caching or ListPodSandboxStats but the improvements aren't consistent, and as you mentioned has a possibility of returning stale results. I'll test some more to see if there's more changes that could help.

As for the workaround around deploying cAdvisor with k3s/RKE, I would like to implement and test this out because it would unblock a lot of users. So cri-dockerd will be updated to fetch metrics from cAdvisor and aggregate the results (as kubelet did before CRI refactoring)? If so, I am a little confused on stop relying on kubelet at all...will clients like metrics-server then have to collect metrics from cAdvisor and aggregate it themselves?

(Also, my original understanding from the KEP was that all container runtimes will be responsible for full stats and not cAdvisor. Would we be going against the upstream expectation here?)

kinarashah commented 1 year ago

@phoenix-bjoern Thanks much for the update! I don't see a huge improvement on my setup. But this is not surprising, I have seen varying results from different users..it comes down to how fast metrics are returned which is dependent on a lot of factors but the main issue in general seems to be how long (and how often) cri-dockerd takes to get metrics using docker client.

evol262 commented 1 year ago

No, we are not going to implement some subset of Prometheus. cri-dockerd would not be updated to fetch metrics from cAdvisor. Instead, the Prometheus deployment in k3s/RKE would point at "full" cAdvisor instead. This is the recommendation from upstream.

No, we would not be going against upstream. Stats collection from "pure" CRI will be implemented with the KEP is ready (or at least graduated from beta). The understanding from the KEP (that container runtimes will be responsible for full stats and not cAdvisor) was right -- it's more that the KEP kept getting pulled back, and cAdvisor was removed anyway, leaving it in kind of a no-man's land.

cri-dockerd is (still) dockershim and doesn't poll any more often (or take longer) than dockershim did. It's being polled far more often by the kubelet now. I'll (still) look into whether there's some kind of maintainable workaround until the KEP is ready, but "full" cAdvisor and polling that from Prometheus in platform solutions is still the recommendation, both upstream and here.

kinarashah commented 1 year ago

@evol262 Yeah, I agree with prometheus deployment pointing to full cAdvisor. Tho the cpu spike we're seeing currently is without any monitoring deployed in the cluster - it's only from internal autoscaling pipelines (I checked only metrics server but would assume to be the same with kubectl top). This led me to think there'd be more places to update.

Yep I see cri-dockerd is still dockershim and the stats codebase is also pretty similar without any big changes. But when running without cri-dockerd, I see kubelet uses NewCadvisorStatsProvider because useLegacyCadvisorStats is true. In fact, with cri-dockerd disabled, I see no calls to docker api (/v1.41/containers/<ID>/json and /v1.41/containers/<ID>/stats) at all so I couldn't do direct comparison in performance between internal dockershim and cri-dockerd.

nwneisen commented 7 months ago

Has anyone tested if they still have high CPU usage after #233 was added?

phoenix-bjoern commented 7 months ago

@nwneisen I haven‘t experienced this issue again (currently using Rancher 2.8.0, RKE1, K8S 1.26). @kinarashah released several fixes for it, but I don’t remember which one exactly resolved the issue. Guess the ticket can be closed.

tmsdce commented 7 months ago

Hi @nwneisen @phoenix-bjoern

I think that Rancher ended up using a forked kubelet to resolve this issue (see the rancher issue here https://github.com/rancher/rancher/issues/38816 and PR here https://github.com/rancher/kubernetes/pull/74).

There is an ongoing discussion in https://github.com/rancher/rke/issues/3409 to see if there is still a need for this forked kubelet as https://github.com/Mirantis/cri-dockerd/pull/233 is now merged.

So from what I understand, https://github.com/Mirantis/cri-dockerd/pull/233 still needs to be tested in a Rancher/RKE context without the forked kubelet.

Mirantis / cri-dockerd

High cpu usage when running ListContainerStats #135