k8s-dqlite spiking cpu core to 100%

WilliamG-LORA commented 2 years ago

Summary

I've setup a 4 node microk8s cluster on bare metal machines. Every now and then the /snap/microk8s/3204/bin/k8s-dqlite process will spike one of the cores on one of my nodes to 100% usage, sending my fans into overdrive.

I can see the ram usage is low and all the other cores are running at <6% usage, and RAM is hardly used: htop

The specs of the machines are as follows:

Node 1:
- CPU: AMD Threadripper 1950X
- RAM: 64GB
Node 2:
- CPU: i7-7820X
- RAM: 64
Node 3:
- CPU: i7-9700
- RAM: 32
Node 4:
- CPU: i7-9700K
- RAM: 64

The cluster has the metallb, dns, rbac, and storage enabled. I've also deployed Rook-Ceph on the cluster.

What Should Happen Instead?

It shouldn't be using over 100% of a core.

Reproduction Steps

Create a microk8s cluster
Deploy Rook-Ceph
Wait a bit.
I'm not sure how to properly reproduce this issue...

Introspection Report

inspection-report-20220608_143601.tar.gz

jhughes2112 commented 1 year ago

On a slightly interesting node... I'm running on ubuntu on all four nodes. ssh'd into each node and ran this on each:

sudo snap disable microk8s

Then after they all went down:

sudo snap enable microk8s

And now everything is running again. HA and everything. I'll monitor and see if somehow this stays alive for more than a couple of hours. But that's an encouragingly easy fix.

cole-miller commented 1 year ago

@jhughes2112 -- thanks for the report. If the issue happens again, and if it's convenient, could you do a perf capture as described here, as well as sharing a debug log from the microk8s leader node as described here? We're working on creating a special isolated release of microk8s that will have more useful diagnostics for this issue enabled out of the box, but in the meantime that manually-collected extra data is helpful for spotting commonalities between affected clusters. In the same spirit, a general description of what workload your cluster is handling, any extra tools like Flux that you've deployed, and when the problem started to manifest would all be great to have.

jhughes2112 commented 1 year ago

Will do. We actually found a clue. We're running RedPanda (kafka) in a container, and it requires about 11k AIO per instance. We're running quite a few of these, and when things fell over, we were getting errors about exhausting the AIO count. It's likely that dqlite is using async io to try to talk to the disk, but can't because there aren't any available. I did run iostat and iotop on a failing node and saw almost zero activity on the disk, but 80-100% cpu on dqlite and/or kubelet while it was failing. It all seems quite related to disk activity. We're refactoring so we never run this many, and hopefully won't see this again.

So, maybe if you want to reproduce this... launch a bunch of copies of RedPanda and see if you can take down your own cluster?

cole-miller commented 1 year ago

It's likely that dqlite is using async io to try to talk to the disk, but can't because there aren't any available.

Indeed, dqlite's libraft uses kernel AIO for disk reads and writes under the hood, and it has a fallback codepath that just does the normal syscalls on a thread from the libuv blocking pool. So you could certainly see some extra CPU usage as a result of io_submit(2) frequently failing. I'm not sure whether that's likely to be the explanation for the other reports in this issue thread, though.

jhughes2112 commented 1 year ago

In my case, kubectl and microk8s status are completely non-responsive, but all my pods are running fine as long as they don't need to speak to the control plane. It sounds exactly the same as raft timouts and disk starvation. Just a different way to arrive at it than some of the others, perhaps.

djjudas21 commented 1 year ago

Hmm, not sure if related but the talk about disk access reminded I often used to run into failed to create fsnotify watcher: too many open files errors e.g. when running commands like kubectl logs -f ...... I worked around this by running

sysctl -w fs.inotify.max_user_watches=100000
sysctl -w fs.inotify.max_user_instances=100000

and appending those lines to /var/snap/microk8s/current/args/containerd-env

doctorpangloss commented 1 year ago

I am investigating my workload against k0s distributed 1.26.2-RC. Additionally, the following versions of kubernetes controllers differ compared to my EKS deployment:

kubernetes 1.26 (microk8s) versus kubernetes 1.22 (AWS)
keda 2.9.3 (microk8s) versus keda 2.8.0 (AWS). This cannot be rolled back because only 2.9 supports 1.26.
flux 0.40.2 (microk8s) versus flux 0.26.0 (AWS).

I will experiment with the different versions today. 1.26.2-RC has a kubeapiserver memory leak, so I will be resetting that work today back to 1.26.1 and observing it overnight with my workload. etcd does not have any unusual behavior.

doctorpangloss commented 1 year ago

@cole-miller generated some valuable traces on my microk8s cluster that point to some application misbehavior interacting with watches too. We shall see.

doctorpangloss commented 1 year ago

I can confirm that k0sctl deployed vanilla Kubernetes with etcd does not exhibit the issue on my workload.

whiskerch commented 1 year ago

@cole-miller - We don't use flux on our system.

The main components we have deployed are:

OpenSearch
Wazuh
Istio
openfaas
nextcloud

cole-miller commented 1 year ago

Hi all -- thanks for bearing with us. @neoaggelos kindly helped me put together a special channel for the microk8s snap, 1.26/edge/debug-dqlite, that sets up dqlite and k8s-dqlite to collect some more useful/interpretable diagnostics. If you're able to install it on your cluster and still reproduce the issue, we'd really appreciate you doing the following to gather data:

Before starting microk8s, add --debug to /var/snap/microk8s/current/args/k8s-dqlite and LIBDQLITE_TRACE=1 to /var/snap/microk8s/current/args/k8s-dqlite-env
Gather logs with journalctl as described above. It would also be great if you could re-run the bpftrace script that times invocations of sqlite3_step, and gather perf data with perf record -g -p $(pidof k8s-dqlite). (These should be more useful with the special snap release, which includes more debuginfo than previously.)

frasmarco commented 1 year ago

Hi all,

@cole-miller we are facing the same problem, managed to switch snap to the 1.26/edge/debug-dqlite channel, enable the logs. Our cluster show the master dqlite process always eating 100% of a core snap.microk8s.daemon-k8s-dqlite.log with some spike up to 600%. Attached the logs.

cole-miller commented 1 year ago

Thank you very much @frasmarco! A perf capture would be great to have too, if the cluster is still available to run diagnostics on. Appreciate your help and sorry that this issue is affecting you.

frasmarco commented 1 year ago

The perf data is huge, I processed it with perf script. You can find it here: https://www.dropbox.com/s/zbki1740q1jix4g/perf.script.bz2?dl=0

cole-miller commented 1 year ago

Thanks again @frasmarco, this is great. Flamegraph for everyone to look at (GitHub strips some useful JS out, I'll post the full thing on MM): perf

With the debuginfo for libsqlite3 available, I notice that we spend a lot of time in likeFunc, which implements the WHERE x LIKE pattern SQL feature. The distinct queries that kine sends us that use LIKE are (in decreasing order of frequency based on the logs from frasmarco):

- `SELECT ( SELECT rkv.id FROM kine rkv ORDER BY rkv.id DESC LIMIT 1), ( SELECT crkv.prev_revision FROM kine crkv WHERE crkv.name = 'compact_rev_key' ORDER BY crkv.id DESC LIMIT 1), kv.id as theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine kv JOIN ( SELECT MAX(mkv.id) as id FROM kine mkv WHERE mkv.name LIKE ? GROUP BY mkv.name) maxkv ON maxkv.id = kv.id WHERE (kv.deleted = 0 OR ?) ORDER BY kv.id ASC LIMIT 1` - `SELECT ( SELECT rkv.id FROM kine rkv ORDER BY rkv.id DESC LIMIT 1), ( SELECT crkv.prev_revision FROM kine crkv WHERE crkv.name = 'compact_rev_key' ORDER BY crkv.id DESC LIMIT 1), kv.id as theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine kv JOIN ( SELECT MAX(mkv.id) as id FROM kine mkv WHERE mkv.name LIKE ? GROUP BY mkv.name) maxkv ON maxkv.id = kv.id WHERE (kv.deleted = 0 OR ?) ORDER BY kv.id ASC` - `SELECT ( SELECT rkv.id FROM kine rkv ORDER BY rkv.id DESC LIMIT 1), COUNT(c.theid) FROM ( SELECT ( SELECT rkv.id FROM kine rkv ORDER BY rkv.id DESC LIMIT 1), ( SELECT crkv.prev_revision FROM kine crkv WHERE crkv.name = 'compact_rev_key' ORDER BY crkv.id DESC LIMIT 1), kv.id as theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine kv JOIN ( SELECT MAX(mkv.id) as id FROM kine mkv WHERE mkv.name LIKE ? GROUP BY mkv.name) maxkv ON maxkv.id = kv.id WHERE (kv.deleted = 0 OR ?) ORDER BY kv.id ASC) c` - `SELECT ( SELECT rkv.id FROM kine rkv ORDER BY rkv.id DESC LIMIT 1), ( SELECT crkv.prev_revision FROM kine crkv WHERE crkv.name = 'compact_rev_key' ORDER BY crkv.id DESC LIMIT 1), kv.id as theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine kv WHERE kv.name LIKE ? AND kv.id > ? ORDER BY kv.id ASC LIMIT 500` - `SELECT ( SELECT rkv.id FROM kine rkv ORDER BY rkv.id DESC LIMIT 1), ( SELECT crkv.prev_revision FROM kine crkv WHERE crkv.name = 'compact_rev_key' ORDER BY crkv.id DESC LIMIT 1), kv.id as theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine kv JOIN ( SELECT MAX(mkv.id) as id FROM kine mkv WHERE mkv.name LIKE ? AND mkv.id <= ? GROUP BY mkv.name) maxkv ON maxkv.id = kv.id WHERE (kv.deleted = 0 OR ?) ORDER BY kv.id ASC LIMIT 1000` - `SELECT ( SELECT rkv.id FROM kine rkv ORDER BY rkv.id DESC LIMIT 1), ( SELECT crkv.prev_revision FROM kine crkv WHERE crkv.name = 'compact_rev_key' ORDER BY crkv.id DESC LIMIT 1), kv.id as theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine kv JOIN ( SELECT MAX(mkv.id) as id FROM kine mkv WHERE mkv.name LIKE ? GROUP BY mkv.name) maxkv ON maxkv.id = kv.id WHERE (kv.deleted = 0 OR ?) ORDER BY kv.id ASC LIMIT 501` - `SELECT ( SELECT rkv.id FROM kine rkv ORDER BY rkv.id DESC LIMIT 1), ( SELECT crkv.prev_revision FROM kine crkv WHERE crkv.name = 'compact_rev_key' ORDER BY crkv.id DESC LIMIT 1), kv.id as theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine kv JOIN ( SELECT MAX(mkv.id) as id FROM kine mkv WHERE mkv.name LIKE ? GROUP BY mkv.name) maxkv ON maxkv.id = kv.id WHERE (kv.deleted = 0 OR ?) ORDER BY kv.id ASC LIMIT 10001` - `SELECT ( SELECT rkv.id FROM kine rkv ORDER BY rkv.id DESC LIMIT 1), ( SELECT crkv.prev_revision FROM kine crkv WHERE crkv.name = 'compact_rev_key' ORDER BY crkv.id DESC LIMIT 1), kv.id as theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine kv JOIN ( SELECT MAX(mkv.id) as id FROM kine mkv WHERE mkv.name LIKE ? AND mkv.id <= ? GROUP BY mkv.name) maxkv ON maxkv.id = kv.id WHERE (kv.deleted = 0 OR ?) ORDER BY kv.id ASC LIMIT 501` - `SELECT ( SELECT rkv.id FROM kine rkv ORDER BY rkv.id DESC LIMIT 1), ( SELECT crkv.prev_revision FROM kine crkv WHERE crkv.name = 'compact_rev_key' ORDER BY crkv.id DESC LIMIT 1), kv.id as theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine kv JOIN ( SELECT MAX(mkv.id) as id FROM kine mkv WHERE mkv.name LIKE ? AND mkv.id <= ? GROUP BY mkv.name) maxkv ON maxkv.id = kv.id WHERE (kv.deleted = 0 OR ?) ORDER BY kv.id ASC LIMIT 10001` - `SELECT ( SELECT rkv.id FROM kine rkv ORDER BY rkv.id DESC LIMIT 1), ( SELECT crkv.prev_revision FROM kine crkv WHERE crkv.name = 'compact_rev_key' ORDER BY crkv.id DESC LIMIT 1), kv.id as theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine kv WHERE kv.name LIKE ? AND kv.id > ? ORDER BY kv.id ASC`

Perhaps those queries would be a good target for optimization on the kine side.

cole-miller commented 1 year ago

@frasmarco, @djjudas21, @doctorpangloss, and any other affected users -- if you've still got an affected cluster node around, ideally a recent leader, please consider keeping at least the contents of /var/snap/microk8s/current/var/kubernetes/backend around. I don't recommend that you share those files here since they likely contain some sensitive information, but I'm working on a script for you to run an analysis on them locally.

cole-miller commented 1 year ago

Here's a script to dump the microk8s database in the normal sqlite3 format:

#!/bin/bash

# No arguments. Make sure microk8s is not active before running.
# If you're running this on another machine to which the contents
# of the kubernetes `backend` directory have been copied, set DATA_DIR
# to the directory containing those files. Needs a Go toolchain, unless
# you've already installed the `dqlite-demo` and `dqlite` binaries.

set -eu
shopt -s extglob nullglob

setup-go-dqlite() {
    add-apt-repository -y ppa:dqlite/dev
    apt install -y libdqlite-dev
    git clone https://github.com/canonical/go-dqlite --depth 1
    cd go-dqlite
    export CGO_LDFLAGS_ALLOW=-Wl,-z,now
    go build -o dqlite-demo cmd/dqlite-demo/dqlite-demo.go
    go build -o dqlite cmd/dqlite/dqlite.go
    demo="$PWD/dqlite-demo"
    dqlite="$PWD/dqlite"
    cd ..
}

demo=dqlite-demo
dqlite=dqlite
which dqlite-demo && which dqlite || setup-go-dqlite

datadir="${DATA_DIR:-/var/snap/microk8s/current/var/kubernetes/backend}"
mkdir -p analysis-scratch
cd analysis-scratch
cp "$datadir"/{*.yaml,open*,snapshot*,metadata*,000*-000*} .
# Overwrite cluster.yaml to contain just this node.
cp localnode.yaml cluster.yaml
dbaddr=$(head -n1 info.yaml | sed 's/Address: \(.\+\)$/\1/')
"$demo" --dir . --api 127.0.0.1:9999 --db $dbaddr &
demopid=$!
echo Demo server running, API address 127.0.0.1:9999, DB address $dbaddr, PID $demopid
echo Waiting for server to come online...
sleep 5
"$dqlite" -s $dbaddr k8s ".dump $dbaddr"
cd ..
cp analysis-scratch/db.bin{,-wal} .
echo Database dumped successfully to db.bin and db.bin-wal

I'll follow up with some analysis queries you can run against that database.

ktsakalozos commented 1 year ago

We are working on a PR with a number of query optimizations, including the "LIKE" removal mentioned in https://github.com/canonical/microk8s/issues/3227#issuecomment-1468440350 Have a look: https://github.com/canonical/kine/pull/13 . Stay tuned!

bradleyboveinis commented 1 year ago

I have been running MicroK8s since v1.17 I think and it has generally been rock solid and only broken when I've fiddled with it. Since v1.25/v1.26 there seem to be chronic issues affecting the stability. I have personally lost data due to this dqlite CPU problem (long story short, I lost dqlite quorum which broke the kube api-server, but I was using OpenEBS/cStor clustered storage, which depends on kube quorum for its own quorum. When it lost quorum and the kube api-server become silently read-only, the storage controller got itself into a bad state and volumes could not be mounted).

A colleague pointed me at this post as this is very similar behavior to what I have been experiencing and have been able to reliably replicate.

I run v1.26 3 nodes HA with an argo workflows instance.

I have a workflow that scales out ~2000 jobs to download json data from a http endpoint, then dumps it into a mongodb instance. Each job should take no more than a couple seconds to complete.

When the workflow's parallel setting is set to run 100 simultaneous download jobs and left to run for a bit, it nukes the cluster like you describe and requires a rebuild. Dialing back to 50 or less jobs in parallel does not cause the issue.

cole-miller commented 1 year ago

@bradleyboveinis Thanks for the report. Were you experiencing this with versions of microk8s before 1.26?

bradleyboveinis commented 1 year ago

@bradleyboveinis Thanks for the report. Were you experiencing this with versions of microk8s before 1.26?

I couldn't tell you unfortunately @cole-miller, the offending workflow was only created recently so it has only been run on 1.26.

cole-miller commented 1 year ago

@bradleyboveinis No problem! If you're able to gather some diagnostic info as described above, we'd appreciate it.

ktsakalozos commented 1 year ago

Hi all, we have been working on a number of optimizations on the k8s-dqlite service. You can find a build with our progress in the latest/edge/k8s-dqlite channel (install it with snap install microk8s --classic --channel=latest/edge/k8s-dqlite). In the build currently available the queries have been revised so they do not include the LIKE sql operator. That should reduce the CPU utilization and/or make the CPU spike shorter in duration. We would be grateful for any feedback you could give us. Many thanks.

sbidoul commented 1 year ago

@ktsakalozos I tried installing it on a 3 nodes cluster with snap refresh microk8s --channel=latest/edge/k8s-dqlite but, on the first node I tried, it fails to start with the following error.

microk8s.daemon-k8s-dqlite[181830]: time="2023-03-22T12:00:32+01:00" level=trace msg="QUERY ROW [] : SELECT COUNT(*) FROM key_value"
microk8s.daemon-k8s-dqlite[181830]: time="2023-03-22T12:00:32+01:00" level=trace msg="QUERY ROW [] : SELECT COUNT(*) FROM kine"
microk8s.daemon-k8s-dqlite[181830]: time="2023-03-22T12:00:32+01:00" level=fatal msg="Failed to start server: kine: building kine: sqlite client: near \"AS\": syntax error\n"

ktsakalozos commented 1 year ago

Thank you for the quick feedback @sbidoul. What was the version and history of the cluster you are refreshing? This looks like a problem in the datastore migration path that we have not encountered. You should be able to snap revert to get back to the old snap version until we see what is missing.

sbidoul commented 1 year ago

@ktsakalozos this cluster was created on 1.25 and was recently upgraded to v1.26.1 (snap rev 4595). I reverted to 1.26 using snap refresh microk8s --channel=1.26/stable.

sbidoul commented 1 year ago

Let me know if there is anything I can do to help diagnosing this error.

ktsakalozos commented 1 year ago

@sbidoul the refresh path 1.25 to 1.26 to latest/edge/k8s-dqlite worked for me.

Is it a multi-node or single node cluster? What is the host operating system? Could you share an inspection tarball?

sbidoul commented 1 year ago

@ktsakalozos it is a 3 nodes cluster on Ubuntu 22.04 in LXD virtual machines. Can I DM you (a link to) the inspection tarball?

ktsakalozos commented 1 year ago

We just pushed a new revision in the latest/edge/k8s-dqlite channel. This should address the issue you spotted @sbidoul . Thank you so much for giving it a test drive.

Note that the snap in the latest/edge/k8s-dqlite will perform at its best on new nodes and not on the ones that get refreshed to it.

Dunge commented 1 year ago

As someone who have no idea what he's doing (see #3859), I noticed 100% cpu usage on k8s-dqlite. Just tried running the node under this new latest/edge/k8s-dqlite release, didn't change anything. Still getting constant 100% usage, still getting memcache.go:287] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request still having pods not reacting to any commands I launch and lot of timeouts.

I'm now to a single node (non HA) cluster because I assumed it might help recover, it did not. I don't do any special workload, the only "special" thing I have is Longhorn as the volume provider CSI which I kinda selected at random after reading microk8s plugins weren't stable for production work.

ktsakalozos commented 1 year ago

@Dunge once you are in HA mode (ie you have at least 3 control plane nodes) you cannot drop to a non-HA setup without having the cluster freeze. The reason is that the cluster loses quorum and the datastore locks up so it would not risk any form of corruption. To recover from this state (lost quorum) you should follow the instructions in https://microk8s.io/docs/restore-quorum

Dunge commented 1 year ago

@ktsakalozos yes as I mentioned in my other issue thread I did make sure my cluster.yaml file had only one entry and launched the reconfigure command and microk8s status list only one node with no HA. Haven't tried to make the other two rejoin yet, but it was broken even before I tried to make them leave anyway. Is there any flag somewhere to check if I'm locked in that "frozen datastore" status? You can continue this discussion in the other thread, I don't want to pollute this one. Was just pitching in saying I also noticed 100% cpu usage on dqlite.

ktsakalozos commented 1 year ago

@Dunge could you please follow the instructions in https://github.com/canonical/microk8s/issues/3227#issuecomment-1464823623 so we have a better understanding what is the datastore doing?

Dunge commented 1 year ago

Here's a zip of all 3 files This is using latest/edge/k8s-dqlite, NOT 1.26/edge/debug-dqlite

As I said, I have no clue if this is related to the issue here other than I have 100% cpu usage on dqlite. My cluster status is completely broken. Got errors just running microk8s start and status for the first few times, took about 5 minutes before really started. I'll try to make my two other nodes join back by copying the datastore and issuing join commands now.

sbidoul commented 1 year ago

Note that the snap in the latest/edge/k8s-dqlite will perform at its best on new nodes and not on the ones that get refreshed to it.

@ktsakalozos I'm confused by this statement. Why is it so? Does this branch change the dqlite database schema? Is it possible to revert to 1.26 after installing it?

sbidoul commented 1 year ago

@ktsakalozos @cole-miller I could capture a flamegraph of both q8s-dqlite and kubelite during one of the events, using 1.26/edge/debug-dqlite.

perf-k8s-dqlite3

So the issue in my case is probably elsewhere than in dqlite, as the chart I posted in https://github.com/canonical/microk8s/issues/3227#issuecomment-1452648321 hinted.

Do you have any idea about where these squashfs decompression frames come from in kubelite?

sbidoul commented 1 year ago

Oh I realize GitHub does not preserve the svg file. So here is the perf data for kubelite+q8s-dqlite, as well as all processes on the machine during a similar event. perf.tar.gz The issue seems related to snap squashfs decompression with kubelite being the culprit (or suffering the most).

cole-miller commented 1 year ago

@sbidoul -- thanks for the data. If kubelite is the CPU hog on your cluster then I think you are seeing a different problem from some of the other people in this issue thread, who reported that k8s-dqlite was the culprit.

ktsakalozos commented 1 year ago

@ktsakalozos I'm confused by this statement. Why is it so? Does this branch change the dqlite database schema? Is it possible to revert to 1.26 after installing it?

The latest/edge/k8s-dqlite build introduces two new indexes. On existing deployments that refresh to the new build both old and new indexes are needed because there might be both types of old and new nodes.

sbidoul commented 1 year ago

@cole-miller @ktsakalozos thank you for your guidance to collect perf data so far. It has been helpful to narrow the problem. I agree my case is not a 100% cpu on dqlite. But still it is microk8s components causing a huge CPU load. In the perf stack capture in https://github.com/canonical/microk8s/issues/3227#issuecomment-1490261518, we see kubelite, k8s-dqlite and containerd (from the microk8s snap) spending most of the time handling page faults and squashfs decompression (I assume this is related to the snap loop mounts?). I did a perf stat -p {kubelite pid} and I see it doing hundred to thousands of page faults per second over during such incidents. At the same time, workloads running on the same node seem to behave normally and the node does not have memory pressure that would explain trashing.

By chance would you have any other suggestion to diagnose this further? This happens only on one of my several microk8s clusters which all share a similar configuration and I'm running out of ideas.

ktsakalozos commented 1 year ago

@sbidoul do you see anything interesting in dmesg? Could this be a hardware issue? Maybe we should bring the question to the snapcraft people over at https://forum.snapcraft.io/

sbidoul commented 1 year ago

@ktsakalozos I think I have ruled out hardware issues, since the problem occurs on 3 nodes, one of which is on a completely different hardware, and it also occured in a cloud VM I had in the same cluster at some point. Nothing suspicious in dmesg either.

One thing that maybe stands out is that I have minio running in the cluster, with 2.5 million objects, so 2 volumes on each of these nodes with lots of files. Could it be there is something in kubelite (cAdvisor?) that could be negatively influenced by that?

texaclou commented 1 year ago

Hello, I have the same problem with a mono-node cluster. Version of microk8s : 1.24. Here is a flamegraph : perf flame1-24

ktsakalozos commented 1 year ago

@texaclou in the graph you shared there is a likeFunc that has been identified as one potential bottleneck. In the 1.27 release that came out today we have worked towards eliminating the use of this LIKE-pattern-matching operation so the performance we get from k8s-dqlite should be improved. This patch however is not available on pre-1.27 releases.

alexgorbatchev commented 1 year ago

Seeing the same issue, dqlite started sitting at 100% a few days ago on a 3 node homelab cluster. Was initially running 1.24, upgraded to 1.25 with no changes. Cluster is basically non-functional at this point :(

alexgorbatchev commented 1 year ago

My cluster basically got fully borked now... sqlite requests are timing out because they are taking over 30s...

Trace[553469342]: ---"Write to database call finished" len:155,err:Timeout: request did not complete within requested timeout - context deadline exceeded 34001ms (19:58:09.246)

ktsakalozos commented 1 year ago

Hi @alexgorbatchev would it be possible to share an inspection tarball and also try the 1.27 release where we have introduced a set of datastore improvements?

alexgorbatchev commented 1 year ago

Sure, here it is. Unfortunately I can't upgrade to 1.27 because I'm using Longhorn for storage which currently supports up to 1.25

inspection-report-20230502_091112.tar.gz

texaclou commented 1 year ago

Hi, Same problem again on a 24h cluster, k8s-qlite process blocked à 100% cpu. The cluster is unusable, kubectl get time-out ...

Inspection tarball : inspection-report-20230504_070838.tar.gz

Flamegraph on k8s-qlite process It's a single node cluster with version 1.27.1 ; perf