Open WilliamG-LORA opened 2 years ago
On a slightly interesting node... I'm running on ubuntu on all four nodes. ssh'd into each node and ran this on each:
sudo snap disable microk8s
Then after they all went down:
sudo snap enable microk8s
And now everything is running again. HA and everything. I'll monitor and see if somehow this stays alive for more than a couple of hours. But that's an encouragingly easy fix.
@jhughes2112 -- thanks for the report. If the issue happens again, and if it's convenient, could you do a perf capture as described here, as well as sharing a debug log from the microk8s leader node as described here? We're working on creating a special isolated release of microk8s that will have more useful diagnostics for this issue enabled out of the box, but in the meantime that manually-collected extra data is helpful for spotting commonalities between affected clusters. In the same spirit, a general description of what workload your cluster is handling, any extra tools like Flux that you've deployed, and when the problem started to manifest would all be great to have.
Will do. We actually found a clue. We're running RedPanda (kafka) in a container, and it requires about 11k AIO per instance. We're running quite a few of these, and when things fell over, we were getting errors about exhausting the AIO count. It's likely that dqlite is using async io to try to talk to the disk, but can't because there aren't any available. I did run iostat and iotop on a failing node and saw almost zero activity on the disk, but 80-100% cpu on dqlite and/or kubelet while it was failing. It all seems quite related to disk activity. We're refactoring so we never run this many, and hopefully won't see this again.
So, maybe if you want to reproduce this... launch a bunch of copies of RedPanda and see if you can take down your own cluster?
It's likely that dqlite is using async io to try to talk to the disk, but can't because there aren't any available.
Indeed, dqlite's libraft uses kernel AIO for disk reads and writes under the hood, and it has a fallback codepath that just does the normal syscalls on a thread from the libuv blocking pool. So you could certainly see some extra CPU usage as a result of io_submit(2) frequently failing. I'm not sure whether that's likely to be the explanation for the other reports in this issue thread, though.
In my case, kubectl and microk8s status are completely non-responsive, but all my pods are running fine as long as they don't need to speak to the control plane. It sounds exactly the same as raft timouts and disk starvation. Just a different way to arrive at it than some of the others, perhaps.
Hmm, not sure if related but the talk about disk access reminded I often used to run into failed to create fsnotify watcher: too many open files
errors e.g. when running commands like kubectl logs -f .....
. I worked around this by running
sysctl -w fs.inotify.max_user_watches=100000
sysctl -w fs.inotify.max_user_instances=100000
and appending those lines to /var/snap/microk8s/current/args/containerd-env
I am investigating my workload against k0s distributed 1.26.2-RC. Additionally, the following versions of kubernetes controllers differ compared to my EKS deployment:
I will experiment with the different versions today. 1.26.2-RC has a kubeapiserver memory leak, so I will be resetting that work today back to 1.26.1 and observing it overnight with my workload. etcd
does not have any unusual behavior.
@cole-miller generated some valuable traces on my microk8s cluster that point to some application misbehavior interacting with watches too. We shall see.
I can confirm that k0sctl deployed vanilla Kubernetes with etcd does not exhibit the issue on my workload.
@cole-miller - We don't use flux on our system.
The main components we have deployed are:
Hi all -- thanks for bearing with us. @neoaggelos kindly helped me put together a special channel for the microk8s snap, 1.26/edge/debug-dqlite
, that sets up dqlite and k8s-dqlite to collect some more useful/interpretable diagnostics. If you're able to install it on your cluster and still reproduce the issue, we'd really appreciate you doing the following to gather data:
--debug
to /var/snap/microk8s/current/args/k8s-dqlite
and LIBDQLITE_TRACE=1
to /var/snap/microk8s/current/args/k8s-dqlite-env
perf record -g -p $(pidof k8s-dqlite)
. (These should be more useful with the special snap release, which includes more debuginfo than previously.)Hi all,
@cole-miller we are facing the same problem, managed to switch snap to the 1.26/edge/debug-dqlite channel, enable the logs. Our cluster show the master dqlite process always eating 100% of a core snap.microk8s.daemon-k8s-dqlite.log with some spike up to 600%. Attached the logs.
Thank you very much @frasmarco! A perf capture would be great to have too, if the cluster is still available to run diagnostics on. Appreciate your help and sorry that this issue is affecting you.
The perf data is huge, I processed it with perf script
.
You can find it here: https://www.dropbox.com/s/zbki1740q1jix4g/perf.script.bz2?dl=0
Thanks again @frasmarco, this is great. Flamegraph for everyone to look at (GitHub strips some useful JS out, I'll post the full thing on MM):
With the debuginfo for libsqlite3 available, I notice that we spend a lot of time in likeFunc
, which implements the WHERE x LIKE pattern
SQL feature. The distinct queries that kine sends us that use LIKE are (in decreasing order of frequency based on the logs from frasmarco):
Perhaps those queries would be a good target for optimization on the kine side.
@frasmarco, @djjudas21, @doctorpangloss, and any other affected users -- if you've still got an affected cluster node around, ideally a recent leader, please consider keeping at least the contents of /var/snap/microk8s/current/var/kubernetes/backend around. I don't recommend that you share those files here since they likely contain some sensitive information, but I'm working on a script for you to run an analysis on them locally.
Here's a script to dump the microk8s database in the normal sqlite3 format:
#!/bin/bash
# No arguments. Make sure microk8s is not active before running.
# If you're running this on another machine to which the contents
# of the kubernetes `backend` directory have been copied, set DATA_DIR
# to the directory containing those files. Needs a Go toolchain, unless
# you've already installed the `dqlite-demo` and `dqlite` binaries.
set -eu
shopt -s extglob nullglob
setup-go-dqlite() {
add-apt-repository -y ppa:dqlite/dev
apt install -y libdqlite-dev
git clone https://github.com/canonical/go-dqlite --depth 1
cd go-dqlite
export CGO_LDFLAGS_ALLOW=-Wl,-z,now
go build -o dqlite-demo cmd/dqlite-demo/dqlite-demo.go
go build -o dqlite cmd/dqlite/dqlite.go
demo="$PWD/dqlite-demo"
dqlite="$PWD/dqlite"
cd ..
}
demo=dqlite-demo
dqlite=dqlite
which dqlite-demo && which dqlite || setup-go-dqlite
datadir="${DATA_DIR:-/var/snap/microk8s/current/var/kubernetes/backend}"
mkdir -p analysis-scratch
cd analysis-scratch
cp "$datadir"/{*.yaml,open*,snapshot*,metadata*,000*-000*} .
# Overwrite cluster.yaml to contain just this node.
cp localnode.yaml cluster.yaml
dbaddr=$(head -n1 info.yaml | sed 's/Address: \(.\+\)$/\1/')
"$demo" --dir . --api 127.0.0.1:9999 --db $dbaddr &
demopid=$!
echo Demo server running, API address 127.0.0.1:9999, DB address $dbaddr, PID $demopid
echo Waiting for server to come online...
sleep 5
"$dqlite" -s $dbaddr k8s ".dump $dbaddr"
cd ..
cp analysis-scratch/db.bin{,-wal} .
echo Database dumped successfully to db.bin and db.bin-wal
I'll follow up with some analysis queries you can run against that database.
We are working on a PR with a number of query optimizations, including the "LIKE" removal mentioned in https://github.com/canonical/microk8s/issues/3227#issuecomment-1468440350 Have a look: https://github.com/canonical/kine/pull/13 . Stay tuned!
I have been running MicroK8s since v1.17 I think and it has generally been rock solid and only broken when I've fiddled with it. Since v1.25/v1.26 there seem to be chronic issues affecting the stability. I have personally lost data due to this dqlite CPU problem (long story short, I lost dqlite quorum which broke the kube api-server, but I was using OpenEBS/cStor clustered storage, which depends on kube quorum for its own quorum. When it lost quorum and the kube api-server become silently read-only, the storage controller got itself into a bad state and volumes could not be mounted).
A colleague pointed me at this post as this is very similar behavior to what I have been experiencing and have been able to reliably replicate.
I run v1.26 3 nodes HA with an argo workflows instance.
I have a workflow that scales out ~2000 jobs to download json data from a http endpoint, then dumps it into a mongodb instance. Each job should take no more than a couple seconds to complete.
When the workflow's parallel setting is set to run 100 simultaneous download jobs and left to run for a bit, it nukes the cluster like you describe and requires a rebuild. Dialing back to 50 or less jobs in parallel does not cause the issue.
@bradleyboveinis Thanks for the report. Were you experiencing this with versions of microk8s before 1.26?
@bradleyboveinis Thanks for the report. Were you experiencing this with versions of microk8s before 1.26?
I couldn't tell you unfortunately @cole-miller, the offending workflow was only created recently so it has only been run on 1.26.
@bradleyboveinis No problem! If you're able to gather some diagnostic info as described above, we'd appreciate it.
Hi all, we have been working on a number of optimizations on the k8s-dqlite service. You can find a build with our progress in the latest/edge/k8s-dqlite
channel (install it with snap install microk8s --classic --channel=latest/edge/k8s-dqlite
). In the build currently available the queries have been revised so they do not include the LIKE sql operator. That should reduce the CPU utilization and/or make the CPU spike shorter in duration. We would be grateful for any feedback you could give us. Many thanks.
@ktsakalozos I tried installing it on a 3 nodes cluster with snap refresh microk8s --channel=latest/edge/k8s-dqlite
but, on the first node I tried, it fails to start with the following error.
microk8s.daemon-k8s-dqlite[181830]: time="2023-03-22T12:00:32+01:00" level=trace msg="QUERY ROW [] : SELECT COUNT(*) FROM key_value"
microk8s.daemon-k8s-dqlite[181830]: time="2023-03-22T12:00:32+01:00" level=trace msg="QUERY ROW [] : SELECT COUNT(*) FROM kine"
microk8s.daemon-k8s-dqlite[181830]: time="2023-03-22T12:00:32+01:00" level=fatal msg="Failed to start server: kine: building kine: sqlite client: near \"AS\": syntax error\n"
Thank you for the quick feedback @sbidoul. What was the version and history of the cluster you are refreshing? This looks like a problem in the datastore migration path that we have not encountered. You should be able to snap revert
to get back to the old snap version until we see what is missing.
@ktsakalozos this cluster was created on 1.25 and was recently upgraded to v1.26.1 (snap rev 4595). I reverted to 1.26 using snap refresh microk8s --channel=1.26/stable
.
Let me know if there is anything I can do to help diagnosing this error.
@sbidoul the refresh path 1.25 to 1.26 to latest/edge/k8s-dqlite worked for me.
Is it a multi-node or single node cluster? What is the host operating system? Could you share an inspection tarball?
@ktsakalozos it is a 3 nodes cluster on Ubuntu 22.04 in LXD virtual machines. Can I DM you (a link to) the inspection tarball?
We just pushed a new revision in the latest/edge/k8s-dqlite
channel. This should address the issue you spotted @sbidoul . Thank you so much for giving it a test drive.
Note that the snap in the latest/edge/k8s-dqlite
will perform at its best on new nodes and not on the ones that get refreshed to it.
As someone who have no idea what he's doing (see #3859), I noticed 100% cpu usage on k8s-dqlite. Just tried running the node under this new latest/edge/k8s-dqlite
release, didn't change anything. Still getting constant 100% usage, still getting memcache.go:287] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
still having pods not reacting to any commands I launch and lot of timeouts.
I'm now to a single node (non HA) cluster because I assumed it might help recover, it did not. I don't do any special workload, the only "special" thing I have is Longhorn as the volume provider CSI which I kinda selected at random after reading microk8s plugins weren't stable for production work.
@Dunge once you are in HA mode (ie you have at least 3 control plane nodes) you cannot drop to a non-HA setup without having the cluster freeze. The reason is that the cluster loses quorum and the datastore locks up so it would not risk any form of corruption. To recover from this state (lost quorum) you should follow the instructions in https://microk8s.io/docs/restore-quorum
@ktsakalozos yes as I mentioned in my other issue thread I did make sure my cluster.yaml file had only one entry and launched the reconfigure command and microk8s status list only one node with no HA. Haven't tried to make the other two rejoin yet, but it was broken even before I tried to make them leave anyway. Is there any flag somewhere to check if I'm locked in that "frozen datastore" status? You can continue this discussion in the other thread, I don't want to pollute this one. Was just pitching in saying I also noticed 100% cpu usage on dqlite.
@Dunge could you please follow the instructions in https://github.com/canonical/microk8s/issues/3227#issuecomment-1464823623 so we have a better understanding what is the datastore doing?
Here's a zip of all 3 files
This is using latest/edge/k8s-dqlite
, NOT 1.26/edge/debug-dqlite
As I said, I have no clue if this is related to the issue here other than I have 100% cpu usage on dqlite. My cluster status is completely broken. Got errors just running microk8s start and status for the first few times, took about 5 minutes before really started. I'll try to make my two other nodes join back by copying the datastore and issuing join commands now.
Note that the snap in the latest/edge/k8s-dqlite will perform at its best on new nodes and not on the ones that get refreshed to it.
@ktsakalozos I'm confused by this statement. Why is it so? Does this branch change the dqlite database schema? Is it possible to revert to 1.26 after installing it?
@ktsakalozos @cole-miller I could capture a flamegraph of both q8s-dqlite and kubelite during one of the events, using 1.26/edge/debug-dqlite
.
So the issue in my case is probably elsewhere than in dqlite, as the chart I posted in https://github.com/canonical/microk8s/issues/3227#issuecomment-1452648321 hinted.
Do you have any idea about where these squashfs decompression frames come from in kubelite?
Oh I realize GitHub does not preserve the svg file. So here is the perf data for kubelite+q8s-dqlite, as well as all processes on the machine during a similar event. perf.tar.gz The issue seems related to snap squashfs decompression with kubelite being the culprit (or suffering the most).
@sbidoul -- thanks for the data. If kubelite is the CPU hog on your cluster then I think you are seeing a different problem from some of the other people in this issue thread, who reported that k8s-dqlite was the culprit.
@ktsakalozos I'm confused by this statement. Why is it so? Does this branch change the dqlite database schema? Is it possible to revert to 1.26 after installing it?
The latest/edge/k8s-dqlite
build introduces two new indexes. On existing deployments that refresh to the new build both old and new indexes are needed because there might be both types of old and new nodes.
@cole-miller @ktsakalozos thank you for your guidance to collect perf data so far. It has been helpful to narrow the problem. I agree my case is not a 100% cpu on dqlite. But still it is microk8s components causing a huge CPU load. In the perf stack capture in https://github.com/canonical/microk8s/issues/3227#issuecomment-1490261518, we see kubelite, k8s-dqlite and containerd (from the microk8s snap) spending most of the time handling page faults and squashfs decompression (I assume this is related to the snap loop mounts?). I did a perf stat -p {kubelite pid}
and I see it doing hundred to thousands of page faults per second over during such incidents. At the same time, workloads running on the same node seem to behave normally and the node does not have memory pressure that would explain trashing.
By chance would you have any other suggestion to diagnose this further? This happens only on one of my several microk8s clusters which all share a similar configuration and I'm running out of ideas.
@sbidoul do you see anything interesting in dmesg
? Could this be a hardware issue? Maybe we should bring the question to the snapcraft people over at https://forum.snapcraft.io/
@ktsakalozos I think I have ruled out hardware issues, since the problem occurs on 3 nodes, one of which is on a completely different hardware, and it also occured in a cloud VM I had in the same cluster at some point. Nothing suspicious in dmesg either.
One thing that maybe stands out is that I have minio running in the cluster, with 2.5 million objects, so 2 volumes on each of these nodes with lots of files. Could it be there is something in kubelite (cAdvisor?) that could be negatively influenced by that?
Hello, I have the same problem with a mono-node cluster. Version of microk8s : 1.24. Here is a flamegraph :
@texaclou in the graph you shared there is a likeFunc that has been identified as one potential bottleneck. In the 1.27 release that came out today we have worked towards eliminating the use of this LIKE-pattern-matching operation so the performance we get from k8s-dqlite should be improved. This patch however is not available on pre-1.27 releases.
Seeing the same issue, dqlite started sitting at 100% a few days ago on a 3 node homelab cluster. Was initially running 1.24, upgraded to 1.25 with no changes. Cluster is basically non-functional at this point :(
My cluster basically got fully borked now... sqlite requests are timing out because they are taking over 30s...
Trace[553469342]: ---"Write to database call finished" len:155,err:Timeout: request did not complete within requested timeout - context deadline exceeded 34001ms (19:58:09.246)
Hi @alexgorbatchev would it be possible to share an inspection tarball and also try the 1.27 release where we have introduced a set of datastore improvements?
Sure, here it is. Unfortunately I can't upgrade to 1.27 because I'm using Longhorn for storage which currently supports up to 1.25
Hi, Same problem again on a 24h cluster, k8s-qlite process blocked à 100% cpu. The cluster is unusable, kubectl get time-out ...
Inspection tarball : inspection-report-20230504_070838.tar.gz
Flamegraph on k8s-qlite process It's a single node cluster with version 1.27.1 ;
Summary
I've setup a 4 node microk8s cluster on bare metal machines. Every now and then the
/snap/microk8s/3204/bin/k8s-dqlite
process will spike one of the cores on one of my nodes to 100% usage, sending my fans into overdrive.I can see the ram usage is low and all the other cores are running at <6% usage, and RAM is hardly used:
The specs of the machines are as follows:
The cluster has the metallb, dns, rbac, and storage enabled. I've also deployed Rook-Ceph on the cluster.
What Should Happen Instead?
It shouldn't be using over 100% of a core.
Reproduction Steps
Introspection Report
inspection-report-20220608_143601.tar.gz