canonical / charm-microk8s

Charm that deploys MicroK8s
Apache License 2.0
10 stars 16 forks source link

`database locked` situation while creating namespace should be handled more gracefully in the Charm #126

Open motjuste opened 11 months ago

motjuste commented 11 months ago

Summary

The charm does not gracefully handle database locked RPC error when creating a new namespace via Juju.

2023-12-14-09:31:17 root ERROR [localhost] Command failed: juju add-model -c foundations-maas juju-system-microk8s microk8s_cloud
2023-12-14-09:31:17 root ERROR 1[localhost] STDOUT follows:
b''
2023-12-14-09:31:17 root ERROR 2[localhost] STDERR follows:
ERROR creating namespace "juju-system-microk8s": rpc error: code = Unknown desc = exec (try: 500): database is locked

In the syslog for the main machine:

Dec 14 09:31:11 microk8s-55-1-3 microk8s.daemon-kubelite[9310]: I1214 09:31:11.024783    9310 job_controller.go:562] "enqueueing job" key="rook-ceph/rook-ceph-csi-detect-version"
Dec 14 09:31:11 microk8s-55-1-3 microk8s.daemon-kubelite[9310]: I1214 09:31:11.106641    9310 replica_set.go:676] "Finished syncing" kind="ReplicaSet" key="rook-ceph/csi-rbdplugin-provisioner-5885496bf5" duration="83.646528ms"
Dec 14 09:31:11 microk8s-55-1-3 microk8s.daemon-kubelite[9310]: I1214 09:31:11.116299    9310 replica_set.go:676] "Finished syncing" kind="ReplicaSet" key="rook-ceph/csi-rbdplugin-provisioner-5885496bf5" duration="651.454µs"
Dec 14 09:31:11 microk8s-55-1-3 microk8s.daemon-kubelite[9310]: I1214 09:31:11.709716    9310 job_controller.go:562] "enqueueing job" key="rook-ceph/rook-ceph-csi-detect-version"
Dec 14 09:31:17 microk8s-55-1-3 microk8s.daemon-k8s-dqlite[11911]: time="2023-12-14T09:31:17Z" level=error msg="error in txn: exec (try: 500): database is locked"
Dec 14 09:31:17 microk8s-55-1-3 microk8s.daemon-kubelite[9310]: E1214 09:31:17.734550    9310 status.go:71] apiserver received an error that is not an metav1.Status: &status.Error{s:(*status.Status)(0xc00950fd48)}: rpc error: code = Unknown desc = exec (try: 500): database is locked
Dec 14 09:31:17 microk8s-55-1-3 microk8s.daemon-kubelite[9310]: I1214 09:31:17.734937    9310 trace.go:236] Trace[1854421033]: "Create" accept:application/json, */*,audit-id:c1a0ffa6-961c-4f85-92c5-da79207e583e,client:10.246.167.34,protocol:HTTP/2.0,resource:namespaces,scope:resource,url:/api/v1/namespaces,user-agent:Go-http-client/2.0,verb:POST (14-Dec-2023 09:31:15.682) (total time: 2052ms):
Dec 14 09:31:17 microk8s-55-1-3 microk8s.daemon-kubelite[9310]: Trace[1854421033]: ["Create etcd3" audit-id:c1a0ffa6-961c-4f85-92c5-da79207e583e,key:/namespaces/juju-system-microk8s,type:*core.Namespace,resource:namespaces 2030ms (09:31:15.704)
Dec 14 09:31:17 microk8s-55-1-3 microk8s.daemon-kubelite[9310]: Trace[1854421033]:  ---"Txn call failed" err:rpc error: code = Unknown desc = exec (try: 500): database is locked 2029ms (09:31:17.734)]
Dec 14 09:31:17 microk8s-55-1-3 microk8s.daemon-kubelite[9310]: Trace[1854421033]: [2.052580676s] [2.052580676s] END

What Should Happen Instead?

A database locked error, which is probably a temporary issue while creating a namespace, should be handled more gracefully, perhaps retrying after some time. It should at least be reported more gracefully to Juju if the retry must be done by that. See feedback from the Juju team in the relevant issue filed with them LP#2046471.

Reproduction Steps

Such issues are difficult to reproduce, but we have a lot of information and crash-dump available in our Solutions QA labs (accessible only to Canonical employees). In any case, a short summary would be as follows:

  1. Using Juju, create a new microk8s_cloud with three nodes (provided by MAAS).
  2. Attempt to create a namespace in this cloud via Juju:
    $ juju add-model -c ... juju-system-microk8s microk8s_cloud

Environment

MicroK8s charm track: 1.28/stable Juju version: 3.1.6 Cloud: MaaS

Additional info, logs

Apart from the parts of the logs mentioned above, a lot of information can be found in the Solutions QA labs, but that is accessible only to Canonical employees. There, the relevant crash-dump artefact can be found at generated/generated/microk8s/juju-crashdump-microk8s-2023-12-14-09.32.16.tar.gz.

ktsakalozos commented 10 months ago

Hi @motjuste, thank you for reporting this. This kind of behavior we often see when the node is under I/O starvation. Usually there is a background process (it does not need to be a pod/container) that performs some I/O heavy task so the kubernetes datastore does not have the chance to write on the disk and distribute the write operation to its peers. The failure of creating the namespace is just a symptom. In the reported case the I/O operation failed because for two seconds we could not reach the DB. During that time we retried the request 500 times. What else was happening on the physical node at that time? Maybe ceph was syncing?