Graceful pod exit in microk8s.stop

jobh commented 2 years ago

There is a long discussion about snap auto-updates in issue #1022. I think the situation would be much improved if stateful pods were allowed to exit gracefully upon microk8s.stop. That would take away the data corruption problems, as per my comment there: https://github.com/ubuntu/microk8s/issues/1022#issuecomment-1010031820

To summarize: When microk8s.stop is issued, the k8s infrastructure is torn down before pods have had a chance to react to SIGTERM. Hence, even though they have 30 seconds to terminate gracefully, they cannot write to host-provisioned paths during this time, nor can they communicate over the network.

My own (unfinished) workaround at this time is to create a new systemd service which Requires all microk8s services, and itself waits for the graceful shutdown of postgresql. For the first time, I've seen these highly desired lines from postgresql,

2022-01-31T12:02:33.911933664+01:00 stderr F 2022-01-31 11:02:33.911 GMT [1] LOG:  received smart shutdown request
2022-01-31T12:02:33.920500924+01:00 stderr F 2022-01-31 11:02:33.916 GMT [1] LOG:  background worker "logical replication launcher" (PID 95) exited with exit code 1
2022-01-31T12:02:33.920700147+01:00 stderr F 2022-01-31 11:02:33.918 GMT [90] LOG:  shutting down
2022-01-31T12:02:33.966934659+01:00 stderr F 2022-01-31 11:02:33.966 GMT [1] LOG:  database system is shut down

I've attached my systemd hacks below, but the reason for opening this issue is to discuss whether this could be generalized and maybe even made default. Perhaps by replacing the postgresql-specific scaling with node drain/uncordon, or by shutting down the infrastructure in a "safe" order.

/etc/systemd/system/microk8s-sentry.service

Note! This is just a proof-of-concept, tested only briefly. For discussion.

[Unit]
Description=Terminate postgresql gracefully in microk8s.stop
Requires=snap.microk8s.daemon-apiserver-kicker.service snap.microk8s.daemon-apiserver.service snap.microk8s.daemon-cluster-agent.service snap.microk8s.daemon-control-plane-kicker.service snap.microk8s.daemon-controller-manager.service snap.microk8s.daemon-etcd.service snap.microk8s.daemon-kubelet.service snap.microk8s.daemon-proxy.service snap.microk8s.daemon-scheduler.service
After=snap.microk8s.daemon-apiserver-kicker.service snap.microk8s.daemon-apiserver.service snap.microk8s.daemon-cluster-agent.service snap.microk8s.daemon-control-plane-kicker.service snap.microk8s.daemon-controller-manager.service snap.microk8s.daemon-etcd.service snap.microk8s.daemon-kubelet.service snap.microk8s.daemon-proxy.service snap.microk8s.daemon-scheduler.service

[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/bin/bash -c "while ! /snap/bin/microk8s kubectl scale statefulset/postgres-postgresql --replicas=1; do sleep 1; echo retrying...; done"
ExecStop=/bin/bash -c "/snap/bin/microk8s kubectl scale statefulset/postgres-postgresql --replicas=0; /snap/bin/microk8s kubectl wait pod/postgres-postgresql-0 --for=delete"

/etc/systemd/system/snap.microk8s.daemon-kubelet.service.d/microk8s-sentry.conf

Add to upstream's [Unit] section to ensure the above service is started automatically along with microk8s. Again, just proof-of-concept.

[Unit]
Wants=microk8s-sentry.service

ktsakalozos commented 2 years ago

Hi @jobh, thank you for reaching out and for sharing your approach to this problem.

Let me first clarify how updates are done. Starting from v1.22 updates (snap refreshes) do not stop the workloads running on the upgrading nodes. This means that the postgres pods will continue working while the k8s services restart. What MicroK8s version are you using?

Something that is not clear to me is why you are saying that the pods cannot write to the hostpath provisioned while the services are stopping. I assume you are using the "storage" addon? The hostpaths are mounted inside the pods and k8s services are not involved during operation. Therefore I would expect the mounts to continue being valid while services update/restart/refresh.

Calling microk8s stop is a command what is expected to stop the services in the node and also kill any workloads running [1]. In a multinode cluster or a long lasting cluster you are not expected to run microk8s stop. Could you explain a bit more the workflow you follow that involves calling the stop command?

Although the approach you have taken is specific to your use case I can see that there may be the need to run custom scripts when shutting down the cluster. We could consider introducing hooks that users can use to inject custom behavior. For example in [1] we could do something like

If there is a user provided hook:
   call the hook script
kill_all_container_shims
stop k8s services

[1] https://github.com/ubuntu/microk8s/blob/master/microk8s-resources/wrappers/microk8s-stop.wrapper#L47

jobh commented 2 years ago

That's good to hear, that refreshes do not stop workloads. I'm on the 1.21 channel, but it sounds like a worthwhile update, and would solve a lot. In my case, I'm under the additional constraint that my company IT resources are sometimes rebooted automatically to apply security patches. This doesn't run microk8s.stop explicitly, but has the same effect by terminating systemd services. So a robust solution should be on the systemd level (sorry if this was unclear).

When I say "cannot write to hostpath", I'm referring to the results from the test shown in https://github.com/ubuntu/microk8s/issues/1022#issuecomment-1010031820. When redirecting its output to a file on the host, nothing was written to this file after microk8s stop even though the process was alive for 30s according to ps. I may have misinterpreted this result though.

jobh commented 2 years ago

@ktsakalozos,

I have been thinking about your hooks suggestion, and I think it is a good one. To combine with my idea of enforcing this on the systemd level, it could be done by an umbrella service like my example above, with the addition of a hook runner functionality in microk8s.

[Service]
...
ExecStart=/snap/bin/microk8s run-hooks post-start
ExecStop=/snap/bin/microk8s run-hooks pre-stop

(there's also the question of how to add and manage hooks of course).

But is it possible to define this in snapcraft.yml? It's not super clear how the snap daemons are mapped to systemd units.

dalbani commented 1 year ago

Hey @ktsakalozos, your message kind of surprises me, because it doesn't match what I experience with MicroK8s (version 1.25 in any case). For example, your wrote:

Starting from v1.22 updates (snap refreshes) do not stop the workloads running on the upgrading nodes.

That's not what the logs for snapd are reporting, but instead:

$ systemctl status snapd.service
● snapd.service - Snap Daemon
...
Nov 10 14:58:35 hetzner-green snapd[2538454]: snapstate.go:1591: cannot refresh snap "microk8s": snap "microk8s" has running apps (kubectl, microk8s), pids: 2013283,201332>
Nov 10 14:58:35 hetzner-green snapd[2538454]: autorefresh.go:540: auto-refresh: all snaps are up-to-date
Nov 10 23:48:34 hetzner-green snapd[2538454]: storehelpers.go:748: cannot refresh: snap has no updates available: "core18", "snapd"
Nov 10 23:48:34 hetzner-green snapd[2538454]: snapstate.go:1591: cannot refresh snap "microk8s": snap "microk8s" has running apps (kubectl, microk8s), pids: 2013283,201332>
Nov 10 23:48:34 hetzner-green snapd[2538454]: autorefresh.go:540: auto-refresh: all snaps are up-to-date
...

And indeed, although MicroK8s was configured to track 1.25/stable, the version on my machine, installed a couple of months old, wasn't the latest revision of this channel.

You also wrote the following in your message:

Calling microk8s stop is a command what is expected to stop the services in the node and also kill any workloads running.

Interesting, because that's not the behavior that I see. On my machine, calling microk8s.stop stops the MicroK8s components indeed but all the various pods/containers of workloads keep running. In practice, which line in microk8s-stop.wrapper should be responsible for killing those workloads? I see this kill_all_container_shims function (declared at https://github.com/canonical/microk8s/blob/master/microk8s-resources/actions/common/utils.sh#L692), but it seems to be only concerned with Kubernetes services, isn't it?

Thanks!

rome-legacy commented 1 year ago

hey fellas, i was wondering about similar (stop) problem. i'm on MicroK8s v1.25.4 revision 4221 and was also wondering, that my workloads are not stopped when stopping microk8s. the core components of kubernetes seem to shutdown properly but for example my neo4j database, which i installed using helm template, consisting of a stateful set, some services and secrets still keeps running. this is basically the only custom workload i have deployed.

these are my enabled addons:

  datastore master nodes: 127.0.0.1:19001
  datastore standby nodes: none
addons:
  enabled:
    dns                  # (core) CoreDNS
    ha-cluster           # (core) Configure high availability on the current node
    helm                 # (core) Helm - the package manager for Kubernetes
    helm3                # (core) Helm 3 - the package manager for Kubernetes
    hostpath-storage     # (core) Storage class; allocates storage from host directory
    metallb              # (core) Loadbalancer for your Kubernetes cluster
    rbac                 # (core) Role-Based Access Control for authorisation
    registry             # (core) Private image registry exposed on localhost:32000
    storage              # (core) Alias to hostpath-storage add-on, deprecated
  disabled:
    cert-manager         # (core) Cloud native certificate management
    community            # (core) The community addons repository
    dashboard            # (core) The Kubernetes dashboard
    gpu                  # (core) Automatic enablement of Nvidia CUDA
    host-access          # (core) Allow Pods connecting to Host services smoothly
    ingress              # (core) Ingress controller for external access
    kube-ovn             # (core) An advanced network fabric for Kubernetes
    mayastor             # (core) OpenEBS MayaStor
    metrics-server       # (core) K8s Metrics Server for API access to service metrics
    observability        # (core) A lightweight observability stack for logs, traces and metrics
    prometheus           # (core) Prometheus operator for monitoring and logging

i noticed it, when i enabled the openebs mayastor addon (which consumes 100% of one cpu by design). after shutting down with microk8s stop the mayastor process was still running and still consuming 100% of one cpu.

i hope it is not too offtopic, regarding the original message, but i'm also interested in this issue :-D

kind regards

benben commented 1 year ago

This is also relevant when you have many microk8s nodes in some kind of autoscaling setup where you want them to be constantly stopped/started.

jadams commented 5 months ago

Also facing this problem, it seems that containerd does not shutdown all it's containers when exiting (although the code shows sending a SIGKILL signal). I have found the only way to get rid of all the pod processes is to

#!/usr/bin/env bash
# get all kubepod pids
readarray -t KUBEPOD_PIDS <<< "$(/usr/bin/find /sys/fs/cgroup/pids/kubepods -name tasks -exec /usr/bin/cat {} \;)"
if (( "${#KUBEPOD_PIDS[@]}" > 1 )); then
    # send SIGTERM to gracefully end pod processes
    /usr/bin/kill "${KUBEPOD_PIDS[@]}"
    # wait 10s for graceful stop
    /usr/bin/sleep 10s
    # again get kubepod pids
    readarray -t KUBEPOD_PIDS <<< "$(/usr/bin/find /sys/fs/cgroup/pids/kubepods -name tasks -exec /usr/bin/cat {} \;)"
    if (( "${#KUBEPOD_PIDS[@]}" > 1 )); then
        # send SIGKILL to remove remaining processes
        /usr/bin/kill -9 "${KUBEPOD_PIDS[@]}"
    fi
fi

canonical / microk8s

Graceful pod exit in microk8s.stop #2895