coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
262 stars 59 forks source link

Zincati fails to query interactive sessions #1780

Open rocketraman opened 4 weeks ago

rocketraman commented 4 weeks ago

Describe the bug

I'm experimenting with Fedora CoreOS in a VM. I'm finding my VM just randomly reboots, even while I'm in the middle of working on it.

Inspecting the logs via journalctl I see this:

Aug 16 09:02:46 myhost zincati[4777]: [ERROR zincati::update_agent] failed to check for interactive sessions: failed to deserialize output of `loginctl`
Aug 16 09:02:46 myhost zincati[4777]: [WARN  zincati::update_agent] assuming no active sessions and proceeding anyway
Aug 16 09:02:46 myhost zincati[4777]: [INFO  zincati::update_agent::actor] staged deployment '40.20240728.3.0' available, proceeding to finalize it
Aug 16 09:02:46 myhost rpm-ostree[1716434]: Loaded sysroot
Aug 16 09:02:46 myhost rpm-ostree[1716434]: Locked sysroot
Aug 16 09:02:46 myhost rpm-ostree[1716434]: Initiated txn FinalizeDeployment for client(dbus:1.5829 unit:zincati.service uid:981): /org/projectatomic/rpmostree1/fedora_coreos
Aug 16 09:02:46 myhost rpm-ostree[1716434]: Process [pid: 1718634 uid: 981 unit: zincati.service] connected to transaction progress
Aug 16 09:02:46 myhost rpm-ostree[1716434]: Finalized deployment; rebooting into 2098f40910d5d7c0171a8f7173d81cb070f4047e1e81d9d5228d5d62e24fe722
Aug 16 09:02:46 myhost rpm-ostree[1716434]: Txn FinalizeDeployment on /org/projectatomic/rpmostree1/fedora_coreos successful
Aug 16 09:02:46 myhost rpm-ostree[1716434]: Unlocked sysroot
Aug 16 09:02:46 myhost rpm-ostree[1716434]: Initiating reboot requested from transaction
Aug 16 09:02:46 myhost systemd-logind[1165]: The system will reboot now!
Aug 16 09:02:46 myhost systemd-logind[1165]: System is rebooting.
Aug 16 09:02:46 myhost rpm-ostree[1716434]: Process [pid: 1718634 uid: 981 unit: zincati.service] disconnected from transaction progress
Aug 16 09:02:46 myhost rpm-ostree[1716434]: In idle state; will auto-exit in 60 seconds
Aug 16 09:02:46 myhost zincati[4777]: [INFO  zincati::update_agent::actor] update finalized: 40.20240728.3.0
Aug 16 09:02:46 myhost zincati[4777]: [INFO  zincati::update_agent::actor] update applied, waiting for reboot: 40.20240728.3.0

From these logs it seems like zincati is checking if there are any active sessions, failing to do so ("failed to check for interactive sessions: failed to deserialize output of loginctl"), and then proceeding to assume there are none and rebooting anyway (!) ("assuming no active sessions and proceeding anyway").

This is annoying to say the least.

Reproduction steps

  1. Install Fedora CoreOS
  2. Wait

Expected behavior

System should not reboot automatically while it is being used!

Actual behavior

System rebooted automatically while being used.

System details

QEMU

NAME="Fedora Linux"
VERSION="40.20240728.3.0 (CoreOS)"
ID=fedora
VERSION_ID=40
VERSION_CODENAME=""
PLATFORM_ID="platform:f40"
PRETTY_NAME="Fedora CoreOS 40.20240728.3.0"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:40"
HOME_URL="https://getfedora.org/coreos/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora-coreos/"
SUPPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
BUG_REPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=40
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=40
SUPPORT_END=2025-05-13
VARIANT="CoreOS"
VARIANT_ID=coreos
OSTREE_VERSION='40.20240728.3.0'

Butane or Ignition config

variant: fcos
version: 1.4.0
passwd:
  users:
    - name: core
      ssh_authorized_keys:
        - ssh-ed25519 xxx

Additional information

The output of loginctl is:

$ loginctl
SESSION  UID USER SEAT TTY STATE  IDLE SINCE
      2 1000 core -    -   active no   -

1 sessions listed.
hrismarin commented 3 weeks ago

You can control when nodes are allowed to reboot to finalize the update. See OS update finalization.

rocketraman commented 3 weeks ago

Surely the default should not be randomly reboot while the box is active?

In any case, the examples given there use Butane to provision the OS with those settings. How can I apply these settings to an already created box?

hrismarin commented 3 weeks ago

For me personally, the immediate strategy makes sense as it finalizes updates as soon as possible.

There is a link at the bottom of the OS update finalization docs section that refers to the Zincati updates strategy docs page.

Here's how to customize Zincati at runtime.

jlebon commented 3 weeks ago

Surely the default should not be randomly reboot while the box is active?

Zincati does check if someone is logged into the system and warns before rebooting, but otherwise yes that is indeed the default. The expectation is that if this doesn't work for you, then you need to configure fancier strategies in Zincati using e.g. the periodic or fleetlock strategies.

jlebon commented 3 weeks ago

In this case, it seems like Zincati failed to query the list of active users. As a sanity-check, what does loginctl list-sessions --output=json show?

Aug 16 09:02:46 myhost zincati[4777]: [ERROR zincati::update_agent] failed to check for interactive sessions: failed to deserialize output of loginctl

We should probably just dump there the output we failed to deserialize

rocketraman commented 3 weeks ago

In this case, it seems like Zincati failed to query the list of active users. As a sanity-check, what does loginctl list-sessions --output=json show?

[{"session":"2","uid":1000,"user":"core","seat":null,"tty":null,"state":"active","idle":false,"since":null}]

Aug 16 09:02:46 myhost zincati[4777]: [ERROR zincati::update_agent] failed to check for interactive sessions: failed to deserialize output of loginctl

We should probably just dump there the output we failed to deserialize

That would be helpful in debugging this issue.

rocketraman commented 3 weeks ago

Aug 16 09:02:46 myhost zincati[4777]: [WARN zincati::update_agent] assuming no active sessions and proceeding anyway

Also, I would suggest this is a bad assumption to make.