coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
264 stars 59 forks source link

ostree-finalize-staged.service times out on slow hardware #1824

Open AdmiralNemo opened 2 hours ago

AdmiralNemo commented 2 hours ago

Describe the bug

On certain slow systems, such as Raspberry Pis, the ostree-finalize-staged.service unit fails often because it takes "too long" to do its work. This often results in reboot loops, where Zincati continuously tries to apply the update, but this fails, so the machine reboots back into the old version.

Oct 29 19:17:15 k8s-aarch64-n0.zzz.xyz systemd[1]: ostree-finalize-staged.service: Stopping timed out. Aborting.
Oct 29 19:17:15 k8s-aarch64-n0.zzz.xyz systemd[1]: ostree-finalize-staged.service: Killing process 17616 (ostree) with signal SIGABRT.
Oct 29 19:17:23 k8s-aarch64-n0.zzz.xyz systemd[1]: ostree-finalize-staged.service: Control process exited, code=dumped, status=6/ABRT
Oct 29 19:17:23 k8s-aarch64-n0.zzz.xyz systemd[1]: ostree-finalize-staged.service: Failed with result 'timeout'.

Reproduction steps

This is consistently reproducible for me on my "regular" Raspberry Pi 4b devices, with "generic" class 10 SD cards. I do not notice it on my CM4 devices that use eMMC.

Expected behavior

I would expect updates to succeed, regardless of how long they take.

Actual behavior

Especially when an upgrade requires "pruning" a previous version of FCOS, updates fail to apply and machines get stuck in a reboot loop. Manually running rpm-ostree cleanup -r usually resolves it.

Increasing the timeout with a unit drop-in configuration file also resolves the issue, e.g.

[Service]
TimeoutStopSec=15m

System details

metal: Raspberry Pi 4b

State: idle
AutomaticUpdatesDriver: Zincati
  DriverState: active; periodically polling for updates (last checked Wed 2024-10-30 23:36:17 UTC)
BootedDeployment:
● fedora:fedora/aarch64/coreos/stable
                  Version: 40.20241006.3.0 (2024-10-21T14:06:10Z)
               BaseCommit: be9fd1180854f8f6e58e673c43e3e8dc7c5ce2ceeaec736d31e7d8fb62469c96
             GPGSignature: Valid signature by 115DF9AEF857853EE8445D0A0727707EA15B79CC
          LayeredPackages: pam_ssh_agent_auth

Butane or Ignition config

No response

Additional information

The ostree-finalized-staged.service unit file has this snip:

# This is a quite long timeout intentionally; the failure mode
# here is that people don't get an upgrade.  We need to handle
# cases with slow rotational media, etc.
TimeoutStopSec=5m

I guess 5 minutes is probably not "quite long" enough?

cgwalters commented 2 hours ago

Do you have substantial data in /etc?

AdmiralNemo commented 2 hours ago

Do you have substantial data in /etc?

Not really?

[core@k8s-aarch64-n0 ~]$ sudo du -xhs /etc
41M /etc

Most of that appears to come from /etc/selinux and /etc/udev. The only data I have added there is 27K in /etc/kubernetes