Open pengwyn opened 2 days ago
I realised I still have access to the logs when the 3-node cluster autorefreshed to version v1.28.13. It appears as though a very similar "failure" happened at that time too, although the cluster itself continued to work. By "failure" I mean that I see:
... grpc: addrConn.createTransport failed to connect to ...
messagesUnit process 16023 (containerd-shim) remains running after unit stopped.
snap.microk8s.daemon-containerd.service: Found left-over process 16023 (containerd-shim) in control group while starting unit. Ignoring.
It is possible that this cluster was broken from this time and our monitoring didn't pick up on it. This seems very unlikely though, as the later autorefresh did cause some problems.
Summary
An autorefresh of the microk8s snap broke the cluster, as it was unable to shutdown pods during the stopping of the service. This meant that once it started again, microk8s created many duplicate pods which stepped on each others toes for both emptyDir mounts and opened host ports.
This has happened to us twice on two different microk8s deployments. In the first case it was on a single-node microk8s, and the second was a 3-node cluster.
What Should Happen Instead?
Microk8s should cleanly stop pods. In the event of unforeseen problems, maybe some better resolution for handling left-over processes. To be honest I don't know what this could be, seems like a tricky situation to recover from.
Reproduction Steps
I'm not sure how to reproduce, or even what the cause is. In one deployment, the snap updated to version v1.28.13, although it is long enough ago I can't say from which version. In the other deployment, the snap updated to version v1.28.14. I believe this second deployment also autorefreshed to 1.28.13 at some point without trouble, so the problem seems to be dependant on external factors (maybe networking?)
Introspection Report
There are various messages in the syslog around the time of the refresh which indicate a problem. I will provide a sample of messages here (these are for the refresh of the single-node deployment to 1.28.13):
There are 100+ repeats of the connection refused message. After these we have:
Again there are 100+ repeats of the "process remains running after unit stopped" message.
And 100+ repeats of the "Found left-over process..."
From comparing to the 3-node cluster microk8s deployment, I see very similar messages when it failed.