k3s: data corruption on ordinary system shutdown

KyleSanderson commented 11 months ago

Describe the bug

Containers and k3s is not stopped before the filesystems are unmounted. This results in a complete loss of state and data during a completely ordinary system shutdown.

Steps To Reproduce

Steps to reproduce the behavior:

Shutdown your system.
Watch containerd shutdown after the filesystem has unmounted your disks, resulting in corruption and loss.

Expected behavior

Shutdown k3s-killall.sh umount

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

Notify maintainers

cc @euank @Mic92 @yajo

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

 - system: `"x86_64-linux"`
 - host os: `Linux 6.1.52, NixOS, 23.05 (Stoat), 23.05.3427.e5f018cf150e`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.13.5`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`

euank commented 11 months ago

Can you provide a little more information about this? Is this a nixos bug, or a k3s/containerd/something-else bug?

Looking upstream at the k3s.service they ship, I see no ExecStop or such, which is the main thing that makes me question if this is NixOS specific.

I also haven't observed issues here. I haven't yet seen anything I'd consider data-loss or corruption when shutting down (or just yanking the power cord) on my machines running k3s.

So, to try and ask some clarifying questions:

What state/data are you referring to? The k3s state (i.e. the info about what pods should be running, etc, stored in sqlite or etcd by k3s)? The state of specific pods or containers?
If k3s state, what datastore, and is the datastore saved on a separate disk (i.e. /var/lib/rancher or such isn't on your root filesystem)? What's the underlying filesystem?
If it's container's state, what volume plugin, and again is it a separate filesystem they're stored on? What's the underlying filesystem?

KyleSanderson commented 11 months ago

Can you provide a little more information about this? Is this a nixos bug, or a k3s/containerd/something-else bug?

Looking upstream at the k3s.service they ship, I see no ExecStop or such, which is the main thing that makes me question if this is NixOS specific.

That is correct, I've had to update this on Ubuntu. It's not a native package there, though.

I also haven't observed issues here. I haven't yet seen anything I'd consider data-loss or corruption when shutting down (or just yanking the power cord) on my machines running k3s.

So, to try and ask some clarifying questions:

What state/data are you referring to? The k3s state (i.e. the info about what pods should be running, etc, stored in sqlite or etcd by k3s)? The state of specific pods or containers?

The containers all operate as if they crashed, with sqlite taking the brunt of the damage. Same goes with applications reporting back their network state on shutdown, which cannot happen when this corruption happens.

If k3s state, what datastore, and is the datastore saved on a separate disk (i.e. /var/lib/rancher or such isn't on your root filesystem)? What's the underlying filesystem?

Local xfs disk.

If it's container's state, what volume plugin, and again is it a separate filesystem they're stored on? What's the underlying filesystem?

(Different) local xfs disks.

euank commented 11 months ago

Thanks for the quick response and additional info, appreciated! I'm still not sure I completely follow or see any issue here though

The containers all operate as if they crashed

To me, that sounds like it's working as intended so far for a default k8s setup.

Have you set the k8s graceful node shutdown options (off by default) for the kubelet? I think enabling them is what's supposed to make pods terminate on kubelet shutdown in more cases.

I don't use those options personally, so I can't say they do work for the nixos package of k3s, but without those set, I think it's more or less expected behavior for pods to not shutdown gracefully on node termination.

with sqlite taking the brunt of the damage

Can you explain what you mean there? sqlite on XFS shouldn't become corrupted, even with an unexpected powerloss.

Is it becoming corrupted where k3s can't read it anymore on restart? Something else?

Same goes with applications reporting back their network state on shutdown, which cannot happen when this corruption happens

That's application-specific logic running on k3s, right? K8s in general doesn't guarantee graceful shutdown of pods (it can't, you might pull the plug on the machine), so it seems like that logic would need to handle this case anyway somehow.

Is there other k3s data-loss or corruption you can be more specific about, for example "After a restart, running this 'k3s kubectl' command gives an error, or presents corrupt data'", or "k3s or containerd refuses to launch with this error"?

KyleSanderson commented 11 months ago

Thanks for the quick response and additional info, appreciated! I'm still not sure I completely follow or see any issue here though

The containers all operate as if they crashed

To me, that sounds like it's working as intended so far for a default k8s setup.

Have you set the k8s graceful node shutdown options (off by default) for the kubelet? I think enabling them is what's supposed to make pods terminate on kubelet shutdown in more cases.

I don't use those options personally, so I can't say they do work for the nixos package of k3s, but without those set, I think it's more or less expected behavior for pods to not shutdown gracefully on node termination.

That's OK, but again there's a forced disk umount while the containers are running, and before they're ever notified that there's a shutdown going on. That's not good, and clearly leaves the entire application state unclean, and in some cases, corrupted.

with sqlite taking the brunt of the damage

Can you explain what you mean there? sqlite on XFS shouldn't become corrupted, even with an unexpected powerloss.

I'm unclear, just had a couple applications straight die on me in containers when this happens. Adding the shutdown hook fixed the corruption.

yajo commented 11 months ago

Adding the shutdown hook fixed the corruption.

So, you found a workaround? Could you share it please?

KyleSanderson commented 11 months ago

[Unit]
Description=rke2-cleanup
DefaultDependencies=no
Before=shutdown.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/rke2-killall.sh
TimeoutStartSec=0

[Install]
WantedBy=shutdown.target

Then you change the requires on the production service to point to the mounts. Obviously there's better solutions.