Open motjuste opened 11 months ago
Hi @motjuste, thank you for reporting this. This kind of behavior we often see when the node is under I/O starvation. Usually there is a background process (it does not need to be a pod/container) that performs some I/O heavy task so the kubernetes datastore does not have the chance to write on the disk and distribute the write operation to its peers. The failure of creating the namespace is just a symptom. In the reported case the I/O operation failed because for two seconds we could not reach the DB. During that time we retried the request 500 times. What else was happening on the physical node at that time? Maybe ceph was syncing?
Summary
The charm does not gracefully handle
database locked
RPC error when creating a new namespace via Juju.In the
syslog
for the main machine:What Should Happen Instead?
A
database locked
error, which is probably a temporary issue while creating a namespace, should be handled more gracefully, perhaps retrying after some time. It should at least be reported more gracefully to Juju if the retry must be done by that. See feedback from the Juju team in the relevant issue filed with them LP#2046471.Reproduction Steps
Such issues are difficult to reproduce, but we have a lot of information and crash-dump available in our Solutions QA labs (accessible only to Canonical employees). In any case, a short summary would be as follows:
microk8s_cloud
with three nodes (provided by MAAS).Environment
MicroK8s charm track: 1.28/stable Juju version: 3.1.6 Cloud: MaaS
Additional info, logs
Apart from the parts of the logs mentioned above, a lot of information can be found in the Solutions QA labs, but that is accessible only to Canonical employees. There, the relevant crash-dump artefact can be found at generated/generated/microk8s/juju-crashdump-microk8s-2023-12-14-09.32.16.tar.gz.