NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
17.22k stars 13.5k forks source link

k3s: data corruption on ordinary system shutdown #255783

Open KyleSanderson opened 11 months ago

KyleSanderson commented 11 months ago

Describe the bug

Containers and k3s is not stopped before the filesystems are unmounted. This results in a complete loss of state and data during a completely ordinary system shutdown.

Steps To Reproduce

Steps to reproduce the behavior:

  1. Shutdown your system.
  2. Watch containerd shutdown after the filesystem has unmounted your disks, resulting in corruption and loss.

Expected behavior

Shutdown k3s-killall.sh umount

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

Notify maintainers

cc @euank @Mic92 @yajo

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

 - system: `"x86_64-linux"`
 - host os: `Linux 6.1.52, NixOS, 23.05 (Stoat), 23.05.3427.e5f018cf150e`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.13.5`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`
euank commented 11 months ago

Can you provide a little more information about this? Is this a nixos bug, or a k3s/containerd/something-else bug?

Looking upstream at the k3s.service they ship, I see no ExecStop or such, which is the main thing that makes me question if this is NixOS specific.

I also haven't observed issues here. I haven't yet seen anything I'd consider data-loss or corruption when shutting down (or just yanking the power cord) on my machines running k3s.

So, to try and ask some clarifying questions:

  1. What state/data are you referring to? The k3s state (i.e. the info about what pods should be running, etc, stored in sqlite or etcd by k3s)? The state of specific pods or containers?
  2. If k3s state, what datastore, and is the datastore saved on a separate disk (i.e. /var/lib/rancher or such isn't on your root filesystem)? What's the underlying filesystem?
  3. If it's container's state, what volume plugin, and again is it a separate filesystem they're stored on? What's the underlying filesystem?
KyleSanderson commented 11 months ago

Can you provide a little more information about this? Is this a nixos bug, or a k3s/containerd/something-else bug?

Looking upstream at the k3s.service they ship, I see no ExecStop or such, which is the main thing that makes me question if this is NixOS specific.

That is correct, I've had to update this on Ubuntu. It's not a native package there, though.

I also haven't observed issues here. I haven't yet seen anything I'd consider data-loss or corruption when shutting down (or just yanking the power cord) on my machines running k3s.

So, to try and ask some clarifying questions:

  1. What state/data are you referring to? The k3s state (i.e. the info about what pods should be running, etc, stored in sqlite or etcd by k3s)? The state of specific pods or containers?

The containers all operate as if they crashed, with sqlite taking the brunt of the damage. Same goes with applications reporting back their network state on shutdown, which cannot happen when this corruption happens.

  1. If k3s state, what datastore, and is the datastore saved on a separate disk (i.e. /var/lib/rancher or such isn't on your root filesystem)? What's the underlying filesystem?

Local xfs disk.

  1. If it's container's state, what volume plugin, and again is it a separate filesystem they're stored on? What's the underlying filesystem?

(Different) local xfs disks.

euank commented 11 months ago

Thanks for the quick response and additional info, appreciated! I'm still not sure I completely follow or see any issue here though

The containers all operate as if they crashed

To me, that sounds like it's working as intended so far for a default k8s setup.

Have you set the k8s graceful node shutdown options (off by default) for the kubelet? I think enabling them is what's supposed to make pods terminate on kubelet shutdown in more cases.

I don't use those options personally, so I can't say they do work for the nixos package of k3s, but without those set, I think it's more or less expected behavior for pods to not shutdown gracefully on node termination.

with sqlite taking the brunt of the damage

Can you explain what you mean there? sqlite on XFS shouldn't become corrupted, even with an unexpected powerloss.

Is it becoming corrupted where k3s can't read it anymore on restart? Something else?

Same goes with applications reporting back their network state on shutdown, which cannot happen when this corruption happens

That's application-specific logic running on k3s, right? K8s in general doesn't guarantee graceful shutdown of pods (it can't, you might pull the plug on the machine), so it seems like that logic would need to handle this case anyway somehow.

Is there other k3s data-loss or corruption you can be more specific about, for example "After a restart, running this 'k3s kubectl' command gives an error, or presents corrupt data'", or "k3s or containerd refuses to launch with this error"?

KyleSanderson commented 11 months ago

Thanks for the quick response and additional info, appreciated! I'm still not sure I completely follow or see any issue here though

The containers all operate as if they crashed

To me, that sounds like it's working as intended so far for a default k8s setup.

Have you set the k8s graceful node shutdown options (off by default) for the kubelet? I think enabling them is what's supposed to make pods terminate on kubelet shutdown in more cases.

I don't use those options personally, so I can't say they do work for the nixos package of k3s, but without those set, I think it's more or less expected behavior for pods to not shutdown gracefully on node termination.

That's OK, but again there's a forced disk umount while the containers are running, and before they're ever notified that there's a shutdown going on. That's not good, and clearly leaves the entire application state unclean, and in some cases, corrupted.

with sqlite taking the brunt of the damage

Can you explain what you mean there? sqlite on XFS shouldn't become corrupted, even with an unexpected powerloss.

I'm unclear, just had a couple applications straight die on me in containers when this happens. Adding the shutdown hook fixed the corruption.

yajo commented 11 months ago

Adding the shutdown hook fixed the corruption.

So, you found a workaround? Could you share it please?

KyleSanderson commented 11 months ago
[Unit]
Description=rke2-cleanup
DefaultDependencies=no
Before=shutdown.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/rke2-killall.sh
TimeoutStartSec=0

[Install]
WantedBy=shutdown.target

Then you change the requires on the production service to point to the mounts. Obviously there's better solutions.

Mic92 commented 11 months ago
[Unit]
Description=rke2-cleanup
DefaultDependencies=no
Before=shutdown.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/rke2-killall.sh
TimeoutStartSec=0

[Install]
WantedBy=shutdown.target

Then you change the requires on the production service to point to the mounts. Obviously there's better solutions.

Sounds like a good solution to me.

yajo commented 11 months ago

ExecStart=/usr/local/bin/rke2-killall.sh

How come you use rke2-killall.sh to kill k3s workloads?

Mic92 commented 11 months ago

I guess it should be this: https://docs.k3s.io/upgrades/killall

Mic92 commented 11 months ago

The killall script from k3s looks also not better than what systemd would do: https://github.com/k3s-io/k3s/blob/550dd0578f79882e1a78d8468fdbefa95faa145c/package/rpm/install.sh#L571 It also just uses kill -9

Mic92 commented 11 months ago

Also the rke2 variant doesn't look great: https://github.com/rancher/rke2/blob/7466261e4792e68baa2cc0c2afd3dcc929d72061/bundle/bin/rke2-killall.sh#L26

Mic92 commented 11 months ago

A proper shutdown would probably drain the node first, but than what about stateful data?

yajo commented 11 months ago

Drain is too much for a simple reboot. HA apps should survive a node being rebooted without problem. Actually, they should survive a node explosion.

However, it would be useful to at least cordon the node before shutdown.

Mic92 commented 11 months ago

Does cordon also stops containers? It doesn't look like it does.

yajo commented 11 months ago

No, it just informs k3s that the node is under maintenance. However, I'm thinking that adding that automatism by default can have undesired side effects. For example, if the node shuts down for unexpected reasons (an upgrade, low on battery... who knows?), workloads wouldn't be rescheduled by k8s. That's probably bad in some scenarios.

KyleSanderson commented 11 months ago

https://github.com/NixOS/nixpkgs/issues/255768 there is this bug as well with a classic switch. The kill, although brutal, would help there as well.

yajo commented 11 months ago

Going back to this issue, I think this is how k3s is designed. Quoting from the docs:

To allow high availability during upgrades, the K3s containers continue running when the K3s service is stopped.

To stop all of the K3s containers and reset the containerd state, the k3s-killall.sh script can be used.

However, I agree that nixos could provide an option to add that killall as a pre-poweroff hook.

zarelit commented 6 months ago

Just passing by and linking a very related issue: https://github.com/NixOS/nixpkgs/issues/98090

superherointj commented 3 months ago

Now that k3s-killall.sh script is available, it is possible to trigger it on shutdown. I'm considering this possible solution:

  systemd.services.killall-k3s = {
    # "killall-k3s" is used as service name, because naming `k3s-*` introduces a loop at `k3s-killall.sh`, delaying shutdown.
    description = "Executes k3s-killall.sh on shutdown";
    wantedBy = [ "multi-user.target" ];
    serviceConfig = {
      Type = "oneshot";
      User = "root";
      # RemainAfterExit=true is required when ExecStart is missing because systemd won't
      # attempt to run ExecStop if it thinks that the service is not running.
      # RemainAfterExit=true causes systemd to believe that the service is running,
      # thereby causing it to run ExecStop at shutdown.
      RemainAfterExit = "yes";
      Restart = "no";
      # Shutdown order is the reverse of startup order. Hence the script has to be placed in ExecStop.
      ExecStop = "${pkgs.k3s}/bin/k3s-killall.sh";
    };
  };

@KyleSanderson Can you test if this solves the data corruption problem you mentioned?

If it fixes problem we could add this to the K3s module. Thoughts?

Am I missing something?

Update: This solution isn't ideal because it is not doing graceful shutdown. I'm only keeping it here for the script of shutdown.

rorosen commented 3 months ago

I have not experienced data corruption on node shutdown so far, but it's true that node shutdown isn't flawless. If you want to make sure that all containers on a node are terminated properly (if possible) before the node shuts down, you should use the graceful node shutdown feature, as mentioned by @euank. I use it for all nodes and am very satisfied. You could do this by configuring the kubelet accordingly, something along the lines (not tested)

- services.k3s.extraFlags = toString [
-  "--kubelet-arg="shutdownGracePeriod=30s"
-  "--kubelet-arg="shutdownGracePeriodCriticalPods=10s"
-];
+ {
+  services.k3s.extraFlags = "--kubelet-arg=config=/etc/rancher/k3s/kubelet-config.yaml";
+  environment.etc."rancher/k3s/kubelet-config.yaml" = {
+    text = builtins.toJSON {
+      apiVersion = "kubelet.config.k8s.io/v1beta1";
+      kind = "KubeletConfiguration";
+      shutdownGracePeriod = "30s";
+      shutdownGracePeriodCriticalPods = "10s";
+    };
+  };
+}

The killall script isn't so graceful and will just send SIGKILL to all containerd processes that were spawned by k3s. This should also happen at some point during shutdown, so I'm not sure that the killall script will fix whatever causes your corruption.

There is also an issue with local path provisioner v0.0.26, which is used by default by the latest k3s release, that lets it ignore shutdown requests. The underlying issue originates from the sig-storage-lib-external-provisioner dependency and is there already fixed on master (by #159). However, there is no release including the fix and it was also not backported so far, although there is a corresponding issue. The whole thing affected me in the sense that systemd run into a timeout when shutting down the local-path-provisioner pod, I haven't noticed it to cause any data corruption. Anyway, if this also is an issue for you, I'm happy to provide a patch that let's you build a fixed local-path-provisioner container image.

EDIT: Thanks to @yajo for pointing out that the configuration snippet would never work (the quotes weren't even valid...). I copied the working config from his comment

KyleSanderson commented 3 months ago

I have not experienced data corruption on node shutdown so far, but it's true that node shutdown isn't flawless. If you want to make sure that all containers on a node are terminated properly (if possible) before the node shuts down, you should use the graceful node shutdown feature, as mentioned by @euank. I use it for all nodes and am very satisfied. You could do this by configuring the kubelet accordingly, something along the lines (not tested)

services.k3s.extraFlags = toString [
  "--kubelet-arg="shutdownGracePeriod=30s"
  "--kubelet-arg="shutdownGracePeriodCriticalPods=10s"
];

The killall script isn't so graceful and will just send SIGKILL to all containerd processes that were spawned by k3s. This should also happen at some point during shutdown, so I'm not sure that the killall script will fix whatever causes your corruption.

There is also an issue with local path provisioner v0.0.26, which is used by default by the latest k3s release, that lets it ignore shutdown requests. The underlying issue originates from the sig-storage-lib-external-provisioner dependency and is there already fixed on master (by #159). However, there is no release including the fix and it was also not backported so far, although there is a corresponding issue. The whole thing affected me in the sense that systemd run into a timeout when shutting down the local-path-provisioner pod, I haven't noticed it to cause any data corruption. Anyway, if this also is an issue for you, I'm happy to provide a patch that let's you build a fixed local-path-provisioner container image.

When the application is killed, the underlying data is sync'd to disk. When you hard remove the disks (like what's happening here with DM/FUSE and similar), before the application is killed, it causes additional problems. The kill-all script isn't great, but provided it at least ends the application, some sort of state can be managed.

yajo commented 1 month ago

⚠️ This configuration from https://github.com/NixOS/nixpkgs/issues/255783#issuecomment-2113157680 won't work:

services.k3s.extraFlags = toString [
  "--kubelet-arg="shutdownGracePeriod=30s"
  "--kubelet-arg="shutdownGracePeriodCriticalPods=10s"
];

The docs say it clearly:

To configure these options, you edit the kubelet configuration file that is passed to kubelet via the --config flag

Instead, use this one:


{
  services.k3s.extraFlags = "--kubelet-arg=config=/etc/rancher/k3s/kubelet-config.yaml";
  environment.etc."rancher/k3s/kubelet-config.yaml" = {
    text = builtins.toJSON {
      apiVersion = "kubelet.config.k8s.io/v1beta1";
      kind = "KubeletConfiguration";
      shutdownGracePeriod = "30s";
      shutdownGracePeriodCriticalPods = "10s";
    };
  };
}
bbigras commented 1 month ago

Do other people have opinions about using the kill-all script? Maybe we could add a k3s module to srvos.

superherointj commented 1 month ago

We likely should expand NixOS module for kubelet config and codify the solution for graceful node shutdown as a sane default. Because everyone will need this. [If this is actually working. I have not tested yet.]

And some other options. It was very tricky to get some of the escaping right. For example:

--kube-apiserver-arg=${ "'allow-privileged=true'" } \

Having to escape like that is not obvious.

superherointj commented 1 month ago

Do other people have opinions about using the kill-all script?

For which purpose? For working around the data corruption issue, the current proposed solution is to use graceful node shutdown.

Maybe we could add a k3s module to srvos.

I don't have a need for that. But we have docs. If you think some example is missing, docs can be improved.

bbigras commented 1 month ago

For which purpose? For working around the data corruption issue, the current proposed solution is to use graceful node shutdown.

I thought that this comment https://github.com/NixOS/nixpkgs/issues/255783#issuecomment-2114508450 suggested the kill-all script instead of the graceful shutdown.

superherointj commented 1 month ago

I thought that this comment #255783 (comment) suggested the kill-all script instead of the graceful shutdown.

Thanks for clarifying. In this case, seems the upstream patch should be applied. And then, the kill all script isn't necessary. Does it make sense?

yajo commented 1 month ago

I think https://github.com/NixOS/nixpkgs/pull/328385 should appear in the changelog.

AFAICS it won't be compatible with current solution from https://github.com/NixOS/nixpkgs/issues/255783#issuecomment-2230357277.

superherointj commented 3 days ago

Big news. @rorosen solution got merged: https://github.com/rancher/local-path-provisioner/pull/445

A beautiful solution awaits. :-)

Thanks @rorosen for the amazing work you are doing!