k8s-dqlite spiking cpu core to 100%

WilliamG-LORA commented 2 years ago

Summary

I've setup a 4 node microk8s cluster on bare metal machines. Every now and then the /snap/microk8s/3204/bin/k8s-dqlite process will spike one of the cores on one of my nodes to 100% usage, sending my fans into overdrive.

I can see the ram usage is low and all the other cores are running at <6% usage, and RAM is hardly used: htop

The specs of the machines are as follows:

Node 1:
- CPU: AMD Threadripper 1950X
- RAM: 64GB
Node 2:
- CPU: i7-7820X
- RAM: 64
Node 3:
- CPU: i7-9700
- RAM: 32
Node 4:
- CPU: i7-9700K
- RAM: 64

The cluster has the metallb, dns, rbac, and storage enabled. I've also deployed Rook-Ceph on the cluster.

What Should Happen Instead?

It shouldn't be using over 100% of a core.

Reproduction Steps

Create a microk8s cluster
Deploy Rook-Ceph
Wait a bit.
I'm not sure how to properly reproduce this issue...

Introspection Report

inspection-report-20220608_143601.tar.gz

maximemoreillon commented 2 years ago

I am facing the same issue on multiple AWS EC2s, each running single node Microk8s instances.

Microk8s version: 1.23 classic

Enabled addons:

DNS
Storage
RBAC

For example, here is a screenshot of htop on an EC2.XLarge (16Gb memory):

microk8s_cpu

Microk8s was running smoothly until this week.

On the other hand, instances running microk8s version 1.18 were not affected.

bc185174 commented 2 years ago

We found similar results where the dqlite service on the leader node was hitting 100% usage. It is mentioned in a few other issues but dqlite is sensitive to slow disk performance. In other scenarios such as a node drain, it took a while to write to the database. You should see in the logs, something similar to microk8s.daemon-kubelite[3802920]: Trace[736557743]: ---"Object stored in database" 7755ms. We found on our cluster it took over ~18000ms to write to the datastore and dqlite could not cope with it. As a result, it led to leader election failures and the kubelite service panics.

We monitored the CPU and RAM utilisation for dqlite and compared it to etcd under the same workload and conditions.

Dqlite Idle

Etcd Idle

benben commented 1 year ago

Can confirm this on Proxmox virtualized VMs. There is a constant high load on a 3 node cluster

djjudas21 commented 1 year ago

I'm seeing something similar. I was running a 4-node HA cluster but it failed (see #3735) so I removed 2 nodes to disable HA mode and hopefully restore quorum , now running 2 nodes, 1 is master. The master has a dqlite process rammed at 100% CPU. Running iotop shows aggregate disk transfer of only a hundred KB/s. The dqlite log shows various transaction logs and mostly they complete in 500-700ms, but occasionally I get a much slower one.

Feb 08 09:08:51 kube05 microk8s.daemon-kubelite[1456693]: Trace[1976283013]: ["GuaranteedUpdate etcd3" audit-id:70abc60c-f365-4744-96fc-a404d34de11b,key:/leases/kube-system/kube-apiserver-b3nhikmrwakntwutkwiesxox4e,type:*coordination.Lease,resource:leases.coordination.k8s.io 7005ms (09:08:44.693)
Feb 08 09:08:51 kube05 microk8s.daemon-kubelite[1456693]: Trace[1976283013]:  ---"Txn call completed" 7005ms (09:08:51.699)]
Feb 08 09:08:51 kube05 microk8s.daemon-kubelite[1456693]: Trace[1976283013]: [7.005829845s] [7.005829845s] END

Hardware isn't exactly a rocketship but all my nodes are i5-6500T with 4 cores, 16GB memory, 256GB SSD and that should be adequate. Most of my workloads are not running at the moment, either.

cole-miller commented 1 year ago

Hi all. I work on dqlite, and I'm going to try to figure out what's causing these CPU usage spikes. If you're experiencing this issue on a continuing basis and are in a position to collect some diagnostic info, I could use your help! Short of a reproducer that I can run myself (which I realize is difficult for this kind of complex system), the data I'd find most helpful would be a sampling profiler report showing where the k8s-dqlite process is spending its time during one of these spikes. A separate report for the same process and workload during a period of nominal CPU usage would also be great, so I can compare the two and see if anything stands out. You can gather this information as follows:

Install the perf command-line tool on all machines in your cluster. On Ubuntu this is part of the linux-tools package (you'll have to pick a "flavor" like linux-tools-generic).
Collect a profile by ssh-ing into the affected node and running perf record -F 99 --call-graph dwarf -p <pid>, where <pid> is the PID of the k8s-dqlite process. That command will keep running and collecting samples until you kill it with Ctrl-C.
Upload the generated perf.data file to somewhere I can access. (It doesn't contain a core dump or anything else that might be sensitive, just backtraces.) Please also share the version of microk8s that you're using.

If the spikes last long enough that you can notice one happening and have time to ssh in and gather data before it ends, do that. Otherwise, since it's probably not feasible to just leave perf running (perf.data would get too huge), you could have a script like

i=0
while true; do
        rm -f perf.data.$i
        i=$(( ( i + 1 ) % 2 ))
        perf record -F 99 --call-graph dwarf -o perf.data.$i -p <pid> sleep 60
done

Or with a longer timeout, etc. Then if you notice after the fact that a spike has occurred, you hopefully still have the perf.data file for that time period around.

Thanks for your reports -- let's get this issue fixed!

djjudas21 commented 1 year ago

Hey @cole-miller, thanks for looking at this.

I've done a perf capture for you, but it is worth noting a couple of things:

Mine aren't transient CPU spikes, dqlite just hammers the CPU at 100% from the moment it is started
In a separate issue I'm investigating with @ktsakalozos whether this high CPU is caused by corruption of the dqlite database

With those caveats, here's my attached perf.data which I ran for about a minute on MicroK8s v1.26.1 (rev 4595). Hope it is useful to you.

perf.tar.gz

doctorpangloss commented 1 year ago

@cole-miller likewise, k8s-dqlite doesn't spike, it's just at 100% all the time.

perf.data.zip

doctorpangloss commented 1 year ago

It seems like a lot of people are affected by this issue.

djjudas21 commented 1 year ago

As well as my prod cluster being affected by this, last week I quickly threw together a 3-node MicroK8s cluster on v1.26 in VirtualBox to test something. No workloads. Initially it worked normally, but then I shut down all 3 VMs. When I booted them up again later, I had the dqlite 100% CPU problem. I didn't have time to look into it as I was working on something else, but it does show that it can happen on a new cluster that hasn't been "messed with".

djjudas21 commented 1 year ago

I understand that MicroK8s is free software, no guarantees, etc, but it is run by a prominent company like Canonical so it is surprising/disappointing that there are quite few serious, long-standing issues, affecting multiple users, that don't appear to be getting much attention from maintainers (for example this one, #3735 and #3204)

I have been running MicroK8s since v1.17 I think and it has generally been rock solid and only broken when I've fiddled with it. Since v1.25/v1.26 there seem to be chronic issues affecting the stability. I have personally lost data due to this dqlite CPU problem (long story short, I lost dqlite quorum which broke the kube api-server, but I was using OpenEBS/cStor clustered storage, which depends on kube quorum for its own quorum. When it lost quorum and the kube api-server become silently read-only, the storage controller got itself into a bad state and volumes could not be mounted).

A lot of people are talking about switching to k3s, and I don't want to be that guy who rants about switching, but it is something I will consider doing at next refresh. I note that k3s ditched dqlite in favour of etcd in v1.19. I don't know what their reasons were, but it was probably a good move.

AlexsJones commented 1 year ago

I understand that MicroK8s is free software, no guarantees, etc, but it is run by a prominent company like Canonical so it is surprising/disappointing that there are quite few serious, long-standing issues, affecting multiple users, that don't appear to be getting much attention from maintainers (for example this one, #3735 and #3204)

I have been running MicroK8s since v1.17 I think and it has generally been rock solid and only broken when I've fiddled with it. Since v1.25/v1.26 there seem to be chronic issues affecting the stability. I have personally lost data due to this dqlite CPU problem (long story short, I lost dqlite quorum which broke the kube api-server, but I was using OpenEBS/cStor clustered storage, which depends on kube quorum for its own quorum. When it lost quorum and the kube api-server become silently read-only, the storage controller got itself into a bad state and volumes could not be mounted).

A lot of people are talking about switching to k3s, and I don't want to be that guy who rants about switching, but it is something I will consider doing at next refresh. I note that k3s ditched dqlite in favour of etcd in v1.19. I don't know what their reasons were, but it was probably a good move.

Hi Jonathan,

I lead Kubernetes at Canonical and I wanted to firstly offer both an apology and a thank you for your help thus far. We know that there are situations where people like yourself are using MicroK8s and suddenly unbeknownst to you something goes wrong - not being able to easily solve that only compounds the problem.

Our ambition for MicroK8s is to keep it as simple as possible, both in day-to-day operation but also upgrading. I wanted to thank you again for taking the time to engage with us and help us try and improve our projects. Projects plural, as DQLite is also something we build and maintain, hence our investment in building it for a low-ops K8s database backend. ( That said, there is the option to run ETCD with MicroK8s should you desire too ).

This resilience issue is being taken extremely seriously and we are configuring machines to try and reproduce your environment as we are aware of it to the best of our abilities and to work with our DQLite team counterparts to resolve any performance issues. ( Please do let us know what your storage configuration is, localpath etc ).

I believe one thing that really sets MicroK8s apart from alternatives is that we have no secret agenda here. We are here to serve and help our community grow and in that is a promise of working together to make sure our end-users and community members are assisted as much as humanly possible. We will not rest until any potential issues have been exhaustively analysed; independently of whether this is a quirk of setup or environment.

All that said, I will do the following immediately:

Ensure our team has picked this up with the DQLite team to analyze the performance results
Run a benchmark and sanity test against our own lab equipment
Test scenarios like mentioned ( e.g node drain with a workload like ceph on dqlite )
Keep you abreast of updates in this thread.

djjudas21 commented 1 year ago

Thanks @AlexsJones, I really appreciate the detailed response. It's good to hear that there is more going on behind the scenes than I was aware of. I'll be happy to help with testing and providing inspections reports, etc. I've generally had really good experiences interacting with the Canonical team on previous issues (@ktsakalozos in particular has been really helpful).

My specific environment is 4 identical hardware nodes, each with a SATA SSD which has the OS (Ubuntu LTS 22.04) and also the Snap stuff, so that will include the dqlite database. Each node also has an M.2 NVMe which is claimed by OpenEBS/cStor for use as clustered storage.

I don't use any localpath storage. I do also have an off-cluster TrueNAS server which provides NFS volumes via a RWX storage class.

I'm actually part-way through a series of blog posts and the first part covers my architecture. The third part was going to be about OpenEBS/cStor but then it all went wrong, so I'm holding off on writing that!

AlexsJones commented 1 year ago

Thanks for the additional detail, I've setup a bare metal cluster ( albeit lower scale than yours ) and will look to install EBS/cstor with rook-ceph. We will then conduct a soak test and a variety of interrupts to generate data

djjudas21 commented 1 year ago

Thanks. I installed my cStor from Helm directly, rather than using Rook. I've just made a gist with my values file so you can create a similar environment if you need to, although I the root cause of my problem was with dqlite rather than cStor.

AlexsJones commented 1 year ago

Thanks. I installed my cStor from Helm directly, rather than using Rook. I've just made a gist with my values file so you can create a similar environment if you need to, although I the root cause of my problem was with dqlite rather than cStor.

It's still worth investigating as there might be a disk activity correlation - Will set this up now

AlexsJones commented 1 year ago

Thanks @AlexsJones, I really appreciate the detailed response. It's good to hear that there is more going on behind the scenes than I was aware of. I'll be happy to help with testing and providing inspections reports, etc. I've generally had really good experiences interacting with the Canonical team on previous issues (@ktsakalozos in particular has been really helpful).

My specific environment is 4 identical hardware nodes, each with a SATA SSD which has the OS (Ubuntu LTS 22.04) and also the Snap stuff, so that will include the dqlite database. Each node also has an M.2 NVMe which is claimed by OpenEBS/cStor for use as clustered storage.

I don't use any localpath storage. I do also have an off-cluster TrueNAS server which provides NFS volumes via a RWX storage class.

I'm actually part-way through a series of blog posts and the first part covers my architecture. The third part was going to be about OpenEBS/cStor but then it all went wrong, so I'm holding off on writing that!

Are you using CephFS or RBD? If so how's that interacting with the cstor SC?

djjudas21 commented 1 year ago

I'm not using CephFS or RBD, only OpenEBS & cStor

AlexsJones commented 1 year ago

Create a microk8s cluster

Deploy Rook-Ceph

Okay, I saw it was mentioned at the top of the thread, thanks!

djjudas21 commented 1 year ago

Okay, I saw it was mentioned at the top of the thread, thanks!

No worries, this isn't my thread originally, but I am affected by the same dqlite CPU spike

cole-miller commented 1 year ago

Hi @djjudas21, @doctorpangloss -- thanks very much for uploading the perf files, they're quite useful for narrowing down the root of the problem. Here are the resulting flamegraphs:

For @djjudas21's data:
For @doctorpangloss's data:

As you can see, CPU usage is dominated by calls to sqlite3_step. Looking at just the children of that function, the big contributions are from SQLite code, with some contributions also from dqlite's custom VFS (about 14% of the grand total), much of which boils down to calls to memcpy (9%). So my preliminary conclusion is that most of the CPU cycles are spent in SQLite, in which case it's likely the route to fixing this problem lies in optimizing the requests send by microk8s to dqlite (via the kine bridge). But I'll continue to investigate whether any parts of dqlite stand out as causing excessive CPU usage.

cole-miller commented 1 year ago

One possible issue: dqlite runs sqlite3_step in the libuv main thread, so if calls to sqlite3_step are taking quite a long time then we're effectively blocking the event loop -- which could have bad downstream consequences like Raft requests timing out and causing leadership churn. @djjudas21, @doctorpangloss, and anyone else who's experiencing this issue, it'd be very helpful if you could follow these steps to generate some data about the distribution of time spent in sqlite3_step:

On an affected node, install bpftrace: sudo apt install bpftrace
Find the PID of k8s-dqlite, then run
```
$ sudo bpftrace -p $k8s_dqlite_pid -e 'uprobe:libsqlite3:sqlite3_step { @start[tid] = nsecs; } uretprobe:libsqlite3:sqlite3_step { @times = hist(nsecs - @start[tid]); delete(@start[tid]); }'
```
That will keep running and gathering data until you kill it with Ctrl-C, and print an ASCII art histogram when it exits, which you can post in this issue thread.

Thanks again for your willingness to help debug this issue!

sbidoul commented 1 year ago

@cole-miller I could capture the requested histogram during about 3 minutes of such an event:

[512, 1K)          11070 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1K, 2K)            3758 |@@@@@@@@@@@@@@@@@                                   |
[2K, 4K)            4186 |@@@@@@@@@@@@@@@@@@@                                 |
[4K, 8K)            3882 |@@@@@@@@@@@@@@@@@@                                  |
[8K, 16K)            882 |@@@@                                                |
[16K, 32K)           966 |@@@@                                                |
[32K, 64K)          1048 |@@@@                                                |
[64K, 128K)          494 |@@                                                  |
[128K, 256K)         428 |@@                                                  |
[256K, 512K)          81 |                                                    |
[512K, 1M)            18 |                                                    |
[1M, 2M)               8 |                                                    |
[2M, 4M)            2208 |@@@@@@@@@@                                          |
[4M, 8M)            1271 |@@@@@                                               |
[8M, 16M)            267 |@                                                   |
[16M, 32M)            50 |                                                    |
[32M, 64M)            18 |                                                    |
[64M, 128M)            1 |                                                    |
[128M, 256M)           0 |                                                    |
[256M, 512M)           0 |                                                    |
[512M, 1G)             0 |                                                    |
[1G, 2G)               0 |                                                    |
[2G, 4G)              10 |                                                    |

This is a 3 nodes cluster with 2 12-CPU nodes and 1 2-CPU node. The particularity (?) is that there is a relatively high (~10ms) latency between nodes 1,2 and node 4. Below the CPU load measurement during the event. This event was on the little node, but it can equally happen on the big nodes, sending CPU load to 300% (apparently due to iowait, not sure).

~Here is a perf capture (which is probably not good quality due to missing /proc/kallsyms?).~

sbidoul commented 1 year ago

Here is another view of the same 15 minutes event, obtained with python psutil.Process(pid).cpu_time().

sbidoul commented 1 year ago

As a side note, I have always been wondering what dqlite is doing to consume 0.2 CPU when the cluster is otherwise idle. Although I don't want to divert this thread if this is unrelated.

doctorpangloss commented 1 year ago

sudo bpftrace -p $(pidof k8s-dqlite) -e 'uprobe:libsqlite3:sqlite3_step { @start[tid] = nsecs; } uretprobe:libsqlite3:sqlite3_step { @times = hist(nsecs - @start[tid]); delete(@start[tid]); }'
Attaching 2 probes...
^C

@start[13719]: 503234295412429
@times: 
[1K, 2K)            6297 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2K, 4K)            5871 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    |
[4K, 8K)            1597 |@@@@@@@@@@@@@                                       |
[8K, 16K)           4113 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                   |
[16K, 32K)           816 |@@@@@@                                              |
[32K, 64K)           542 |@@@@                                                |
[64K, 128K)          397 |@@@                                                 |
[128K, 256K)         500 |@@@@                                                |
[256K, 512K)          59 |                                                    |
[512K, 1M)            17 |                                                    |
[1M, 2M)              13 |                                                    |
[2M, 4M)               0 |                                                    |
[4M, 8M)               0 |                                                    |
[8M, 16M)              0 |                                                    |
[16M, 32M)             0 |                                                    |
[32M, 64M)          3078 |@@@@@@@@@@@@@@@@@@@@@@@@@                           |
[64M, 128M)           71 |                                                    |
[128M, 256M)           0 |                                                    |
[256M, 512M)           0 |                                                    |
[512M, 1G)             0 |                                                    |
[1G, 2G)               0 |                                                    |
[2G, 4G)              16 |                                                    |

cole-miller commented 1 year ago

@sbidoul, @doctorpangloss -- thanks! It does look like a substantial number of our calls to sqlite3_step are taking milliseconds to tens of milliseconds, which is not outrageous but is long enough that we should probably try to move those calls out of the main thread. (We will want to do this in any case for the experimental disk mode.) I will try to work up a patch that does that and we'll make it available for you to try out when it's ready.

(I don't know what to think about the small number of calls that took entire seconds to complete.)

doctorpangloss commented 1 year ago

would it be helpful to have a reproducible environment? i can give you access to a hardware cluster reproduction of what is executing here.

cole-miller commented 1 year ago

@doctorpangloss Yes, that would be very helpful! My SSH keys are listed at https://github.com/cole-miller.keys, and you can contact me at cole.miller@canonical.com to share any private details.

cole-miller commented 1 year ago

@sbidoul I had some trouble generating a flamegraph or perf report from your uploaded data -- if you get the chance, could you try following these steps (first source block, with your own perf.data file) on the machine in question and uploading the SVG? It seems like some debug symbols may be missing from my repro environment, or perhaps we have different builds of SQLite.

Re: your second graph, I'm a little confused because it looks like CPU usage for dqlite goes down during the spike event, and it's the kubelite process that's responsible for the spike. Am I misinterpreting?

As a side note, I have always been wondering what dqlite is doing to consume 0.2 CPU when the cluster is otherwise idle. Although I don't want to divert this thread if this is unrelated.

The main thing that dqlite has to do even in the steady state where it's not receiving any client requests is to exchange Raft "heartbeat" messages with other nodes, so that they don't think it has crashed. If you can gather perf data for one of those idle periods I'd be happy to try to interpret the results (it would be educational for me too, and we might uncover something unexpected).

ktsakalozos commented 1 year ago

As @cole-miller mentioned the behavior we see might be related to the load MicroK8s is putting to dqlite. It would be very beneficial to have a trace of the sql queries that reach the datastore during the CPU spikes. Would it be possible to enable debugging to the k8s-dqlite service and send us the trace when one such spike occurs?

The following adds the debug argument to the k8s-dqlite service and restarts it:

echo "--debug" | sudo tee -a /var/snap/microk8s/current/args/k8s-dqlite
sudo systemctl restart  snap.microk8s.daemon-k8s-dqlite

To see the logs you could do something like:

journalctl -u snap.microk8s.daemon-k8s-dqlite -n 30000

The above pulls last 30000 lines of the logs. Please make sure the CPU spike period is included in the trace and also please tell us when the spike started. I appreciate your help.

benben commented 1 year ago

I am just wondering: I assume that all the people working on this at Canonical run microk8s a lot themselves. Did no one saw this problem in their setups at all? Why are so many people outside of Canonical affected by this but seemingly no one inside of it?

djjudas21 commented 1 year ago

@benben @ktsakalozos I observed this behaviour on a busy production cluster with many workloads hitting the kube api-server. But I also observed it on an empty cluster with zero workloads/operators/etc. Just a blank install. So it can't only be down to the queries being generated by MicroK8s.

ktsakalozos commented 1 year ago

@benben, the issues surfaced during our internal testing are addressed before release. After a release our user base increases about 10.000 times and it is somewhat expected that issues will be found and will be patched in a timely manner. This should not come as a surprise, unfortunately it is a normal process for software.

In the case of this github issue the problem we are called to solve seems to be poor performance of the k8s-dqlite service. This poor performance manifests itself in a few different ways, in https://github.com/canonical/microk8s/issues/3227#issue-1264374571 we see sporadically 100% CPU spikes ("every now and then"), in https://github.com/canonical/microk8s/issues/3227#issuecomment-1154735893 we have a deployment that started to spike its CPU after a week, you reported overall high cpu usage in https://github.com/canonical/microk8s/issues/3227#issuecomment-1333492218 in https://github.com/canonical/microk8s/issues/3227#issuecomment-1157475897 we have slow disk I/O (18secs) latency, etc.

All these reports are helpful as they give us hints on how we can craft a setup where this behavior is reliably reproducible. So the simpler the cluster setup the higher the chances are we can find a solution. It is however possible that the root causes of this behavior is not the same in all cases.

@djjudas21 what can you tell me about the empty cluster with zero workloads that you observed this issue in. Again, looking for a reliable reproducer.

@doctorpangloss could you please add me too? My ssh key is https://launchpad.net/~kos.tsakalozos/+sshkeys and you can reach me at kos.tsakalozos at canonical.com. Many thanks.

djjudas21 commented 1 year ago

@djjudas21 what can you tell me about the empty cluster with zero workloads that you observed this issue in. Again, looking for a reliable reproducer.

@ktsakalozos the empty cluster is 3 x VirtualBox VMs each with 2 cores and 8GB RAM, all running on the same PC (16 cores, 64GB RAM). Ubuntu 22.04, MicroK8s v1.26.1 (rev 4595).

I fired the VMs up again this morning and it hasn't been stuck at 100% CPU, it's hovering around 10% CPU for the dqlite process on each node. I'm watching and if it goes up to 100% I'll run the diagnostics again.

djjudas21 commented 1 year ago

@cole-miller Here are some bpftrace outputs for one my physical nodes and one of my virtual nodes in separate clusters. Neither experienced a 100% CPU dqlite issue

Physical production node:

jonathan@kube05:~$ sudo bpftrace -p 76437 -e 'uprobe:libsqlite3:sqlite3_step { @start[tid] = nsecs; } uretprobe:libsqlite3:sqlite3_step { @times = hist(nsecs - @start[tid]); delete(@start[tid]); }'
Attaching 2 probes...
^C

@times: 
[2K, 4K)            2918 |@@@@@@@@@@                                          |
[4K, 8K)            8799 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                    |
[8K, 16K)          14206 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K)          7240 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
[32K, 64K)           844 |@@@                                                 |
[64K, 128K)         1976 |@@@@@@@                                             |
[128K, 256K)        1941 |@@@@@@@                                             |
[256K, 512K)         699 |@@                                                  |
[512K, 1M)           619 |@@                                                  |
[1M, 2M)            2651 |@@@@@@@@@                                           |
[2M, 4M)            1912 |@@@@@@                                              |
[4M, 8M)               9 |                                                    |
[8M, 16M)              3 |                                                    |
[16M, 32M)            27 |                                                    |
[32M, 64M)             0 |                                                    |
[64M, 128M)            0 |                                                    |
[128M, 256M)           0 |                                                    |
[256M, 512M)           0 |                                                    |
[512M, 1G)             0 |                                                    |
[1G, 2G)               0 |                                                    |
[2G, 4G)              22 |                                                    |

Virtual test node with no workloads:

jonathan@cstor1:~$ sudo bpftrace -p 205509 -e 'uprobe:libsqlite3:sqlite3_step { @start[tid] = nsecs; } uretprobe:libsqlite3:sqlite3_step { @times = hist(nsecs - @start[tid]); delete(@start[tid]); }'
Attaching 2 probes...
^C

@start[205609]: 20594223139835
@times: 
[256, 512)             3 |                                                    |
[512, 1K)           1078 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1K, 2K)             817 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@             |
[2K, 4K)             595 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
[4K, 8K)             681 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                    |
[8K, 16K)            194 |@@@@@@@@@                                           |
[16K, 32K)            56 |@@                                                  |
[32K, 64K)           259 |@@@@@@@@@@@@                                        |
[64K, 128K)          114 |@@@@@                                               |
[128K, 256K)          16 |                                                    |
[256K, 512K)         214 |@@@@@@@@@@                                          |
[512K, 1M)           420 |@@@@@@@@@@@@@@@@@@@@                                |
[1M, 2M)               6 |                                                    |
[2M, 4M)               2 |                                                    |
[4M, 8M)               1 |                                                    |

freeekanayaka commented 1 year ago

I have been running MicroK8s since v1.17 I think and it has generally been rock solid and only broken when I've fiddled with it. Since v1.25/v1.26 there seem to be chronic issues affecting the stability.

@MathieuBordere @ktsakalozos @cole-miller is v1.25/v1.26 the version where dqlite disk-mode was turned on in microk8s?

freeekanayaka commented 1 year ago

As a side note, I have always been wondering what dqlite is doing to consume 0.2 CPU when the cluster is otherwise idle. Although I don't want to divert this thread if this is unrelated.

The main thing that dqlite has to do even in the steady state where it's not receiving any client requests is to exchange Raft "heartbeat" messages with other nodes, so that they don't think it has crashed. If you can gather perf data for one of those idle periods I'd be happy to try to interpret the results (it would be educational for me too, and we might uncover something unexpected).

I can't back this by any data, but my educated guess is that the 0.2 CPU is not due to dqlite per se, but on its workload. The workload in this case is the kine layer, that emulates etcd behavior using SQL.

I believe it would very insightful to have a detailed idea of what happens in the kine system performance-wise at all levels:

in terms of plain SQL queries (what they are, their timing, the amount of rows they yield)
in terms of what pressure they put on dqlite
in terms of what pressure they put on the Go side of kine (the one handling the SQL queries result and mapping them into etcd's API and data model)

This is more or less what @ktsakalozos was also saying.

I believe a nice way to make progress in this investigation would be to have a sort of kine workload generator that basically produces the same/similar etcd API traffic that happens in the real world microk8s deployments that are hitting these issues. At that point one should be able to reproduce the problem by running just kine/dqlite without all the noise of a full microk8s deployment.

cole-miller commented 1 year ago

@freeekanayaka It seems like it shouldn't be too hard to get dqlite to print out a log of all client requests in a format that we could "replay" using Jepsen. Then in a situation like this we could ask people who are experiencing the issue to share their request log as a way for us to hopefully reproduce what's going on.

cole-miller commented 1 year ago

@freeekanayaka

@‌MathieuBordere @‌ktsakalozos @‌cole-miller is v1.25/v1.26 the version where dqlite disk-mode was turned on in microk8s?

The disk mode is exposed from go-dqlite in v1.11.6 and via an experimental flag to k8s-dqlite, but it's not enabled by default and I don't think any of the clusters in this thread are using it.

ktsakalozos commented 1 year ago

So far I was not able to break a cluster in the way that the k8s-dqlite service tops the CPU. However, the high CPU (100+%) load can be observed during excessive deletions. Although the symptom of the CPU load may not be caused by delete operations in the cases reported in this github issue, it may be worth looking into. Here is a reproducer of high cpu load due to deletions:

git clone https://github.com/ktsakalozos/k8s-dqlite-stress.git
cd k8s-dqlite-stress/scripts
sudo ./run-workload.sh

The workload above assumes a MicroK8s cluster is already deployed. A single node cluster would be enough. The run-workload.sh script would download kube-burner and repeat the creation and deletion of 1000 config maps and secrets. On my system after about 10 minutes the deletions start causing the CPU spike. After killing the run-workload.sh the CPU load keeps going for a few more minutes.

Here is the bpftrace output from my laptop.

$ sudo bpftrace -p $(pidof k8s-dqlite) -e 'uprobe:libsqlite3:sqlite3_step { @start[tid] = nsecs; } uretprobe:libsqlite3:sqlite3_step { @times = hist(nsecs - @start[tid]); delete(@start[tid]); }'
Attaching 2 probes...
^C

@start[3352607]: 789291212746175
@times: 
[512, 1K)          27620 |@@@@@@@                                             |
[1K, 2K)          194871 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2K, 4K)           64898 |@@@@@@@@@@@@@@@@@                                   |
[4K, 8K)           55359 |@@@@@@@@@@@@@@                                      |
[8K, 16K)          47044 |@@@@@@@@@@@@                                        |
[16K, 32K)         10416 |@@                                                  |
[32K, 64K)         10232 |@@                                                  |
[64K, 128K)        22900 |@@@@@@                                              |
[128K, 256K)       11980 |@@@                                                 |
[256K, 512K)         887 |                                                    |
[512K, 1M)            54 |                                                    |
[1M, 2M)          109488 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       |
[2M, 4M)           24652 |@@@@@@                                              |
[4M, 8M)             198 |                                                    |
[8M, 16M)             14 |                                                    |
[16M, 32M)             1 |                                                    |
[32M, 64M)             0 |                                                    |
[64M, 128M)            0 |                                                    |
[128M, 256M)           0 |                                                    |
[256M, 512M)           0 |                                                    |
[512M, 1G)             0 |                                                    |
[1G, 2G)               0 |                                                    |
[2G, 4G)              10 |                                                    |

@freeekanayaka @cole-miller indeed we have not released MicroK8s with the disk mode, the 1.27 will have this option as an experimental flag.

On idle a k8s cluster still queries the datastore. Normally a k8s cluster has a number of controllers that would query the current state or watch for changes and act accordingly. Workloads typically include such controllers.

whiskerch commented 1 year ago

We're experiencing this issue currently on a production system - this is only 1 week after we had rebuit the cluster after last running into this.

We have 4 x nodes running microk8s v1.24.7

In my cluster I can see dsqlite hammering the CPU on the master node - every 20 mins or so a different node will take over be in the same position.

One of the main symptoms is that the cluster is unable to schedule any new jobs or containers.

Unfortunately my cluster is on a non-internet connected system and I have limited ability to add new tools so can't run the perf tool.

I have attached the last 5000 lines of jounalctl for the node running dsqlite. I can see it trying to create containers and failing to mount disks (I assume this is rook-ceph struggling)

There are also lots of Objects listed error from kubelite which I think is a secondary symptom? :

Mar 06 07:45:55 dhvk8s1 microk8s.daemon-kubelite[4021620]: I0306 07:45:55.414258 4021620 trace.go:205] Trace[1584856558]: "List(recursive=true) etcd3" key:/crd.projectcalico.org/felixconfigurations,resourceVersion:,resourceVersionMatch:,limit:10000,continue: (06-Mar-2023 07:45:36.884) (total time: 18529ms):
Mar 06 07:45:55 dhvk8s1 microk8s.daemon-kubelite[4021620]: Trace[1584856558]: [18.529557309s] [18.529557309s] END
Mar 06 07:45:55 dhvk8s1 microk8s.daemon-kubelite[4021620]: I0306 07:45:55.414326 4021620 trace.go:205] Trace[178411766]: "Reflector ListAndWatch" name:storage/cacher.go:/crd.projectcalico.org/felixconfigurations (06-Mar-2023 07:45:36.884) (total time: 18529ms):
Mar 06 07:45:55 dhvk8s1 microk8s.daemon-kubelite[4021620]: Trace[178411766]: ---"Objects listed" error:<nil> 18529ms (07:45:55.414)
Mar 06 07:45:55 dhvk8s1 microk8s.daemon-kubelite[4021620]: Trace[178411766]: [18.529654662s] [18.529654662s] END

I have also attached a screenshot of top running on all 4 nodes of the cluster to give an idea of what else is running.

I will keep the system in the error state for today if there is anything I can pull off that would help diagnose the issue (or fixes to test)

screenshot

journalctl.log

djjudas21 commented 1 year ago

One of the main symptoms is that the cluster is unable to schedule any new jobs or containers.

@whiskerch This sounds like the same situation I had, and it was explained to me that loss of quorum causes the api-server to become silently read-only, which of course means you can't change state in your cluster. See https://github.com/canonical/microk8s/issues/3735#issuecomment-1423919476

whiskerch commented 1 year ago

@djjudas21 - I think you are right.

My conundrum is what to do now - rebuilding and restoring the data from backups is a big task and if I the system is going to get into the same state again is frustrating.

I don't think updating microk8s will help as it seems to be an issue in the future versions too.

Did restarting the kubelite service work for you? I can see in the thread you reference you tried it, but not whether you were able to re-establish quorum in the cluster?

djjudas21 commented 1 year ago

@whiskerch I was not able to restore quorum. I kept my cluster in its broken state for a few days to collect diagnostics, but never made progress and eventually had to restore from backup. Most of my data volumes were on an external NAS and could be re-associated with a new cluster, but I also had some containerised clustered volumes with OpenEBS/cStor, which were irretrievably lost. I was pretty grumpy because in all the years of being a professional sysadmin and a home sysadmin, I've never lost data. Ironic that my first significant data loss was in a highly available, clustered, replicated system.

I don't know what to recommend for you. At the time my cluster lost quorum, I was running MicroK8s v1.26 so I rebuilt on the same version, just due to inertia. I'm sort of expecting it to go wrong again, so I'm upping my backups and not bothering with the cStor volumes any more. If/when it explodes next time, I'll consider how I might rebuild it, but k3s is a contender. They switched away from dqlite a few releases ago (I'm not sure why) so hopefully won't be prone to this same mode of failure.

benben commented 1 year ago

@whiskerch @djjudas21 If you ever find a solution for that, I am interested too. I've had rook/ceph running on my recently failed cluster too, but luckily just had migrated one client which I could restore quickly from a backup 🥲

Maybe you could remove all nodes beside one and then readd them to the first one to reach quorum? I tried that but stopped inbetween since 100% cpu was killing everything and I was faster with deleting everything and rebuild from backup.

djjudas21 commented 1 year ago

Careful @benben, I learned that simply removing all nodes down to one does not automatically restore quorum and you can make the situation worse if you're not careful. There are some extra steps you have to follow, although unfortunately they didn't work for me: https://microk8s.io/docs/restore-quorum

ktsakalozos commented 1 year ago

Thank you for reaching out @whiskerch we are interested in the logs of the snap.microk8s.daemon-k8s-dqlite service. Could you please enable the debug output on that service wait for the CPU spike and afterwards send us the logs? This is described in https://github.com/canonical/microk8s/issues/3227#issuecomment-1453079069 . Hopefully this will tell us what is the load the datastore is called to serve. Many thanks.

The command below should show who the datastrore leader node is. The leader is the one holding the golden copy of the database. Could you verify that the leader node is the one showing the high CPU load?

sudo -E /snap/microk8s/current/bin/dqlite -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key -s file:///var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml k8s ".leader"

whiskerch commented 1 year ago

@ktsakalozos - Unfortunately my colleague had started to rebuild the cluster before I could get any logs off. If we run into the same issue again we will capture all we can

cole-miller commented 1 year ago

@whiskerch, @djjudas21, and anyone else who's affected: Are you by any chance running https://fluxcd.io/ on your clusters?

jhughes2112 commented 1 year ago

I'm experiencing this today. Had it happen last Friday on the same boxes. The way I handled it was by explicitly stopping all nodes, editing the cluster.yaml on one that I consider the master. Wiping and reinstalling all the other nodes then rejoining them. My data is all stored on EBS volumes, not on the nodes themselves, so while k8s is angry and can't fit everything on one node while I'm rebuilding and rejoining, it sorts itself out eventually without having to fully reconfigure the cluster. The master node never gets wiped, and retains all the configuration.

This is a nightmare for a production cluster. Thankfully this is only my dev and staging clusters that are failing right now. I'm not going to roll microk8s into production because of this. Looking into k3s and AKS as alternatives.

canonical / microk8s