dustymabe commented 1 year ago

Bug Report

If I have zincati set to only update say on the weekends:

# cat /etc/zincati/config.d/51-weekend-updates.toml 
# start at 11:00 UTC - 6AM EST
[updates]
strategy = "periodic"
[[updates.periodic.window]]
days = [ "Sat", "Sun" ]
start_time = "11:00"
length_minutes = 60

I'd expect that if a new build comes available before my update window happens then my system would delete the pending/staged one and move on to the next one.

For example.. This week we released two testing builds. 37.20230107.2.0 on Tuesday and 37.20230110.2.0 on Thursday. My system saw and staged 37.20230107.2.0 on Wednesday. Here is the current status:

# systemctl status zincati | cat
● zincati.service - Zincati Update Agent
     Loaded: loaded (/usr/lib/systemd/system/zincati.service; enabled; preset: enabled)
     Active: active (running) since Sat 2023-01-07 11:06:01 UTC; 6 days ago
       Docs: https://github.com/coreos/zincati
   Main PID: 1151 (zincati)
     Status: "update staged: 37.20230107.2.0; reboot pending due to update strategy"
      Tasks: 8 (limit: 4581)
     Memory: 16.4M
        CPU: 3min 48.607s
     CGroup: /system.slice/zincati.service
             └─1151 /usr/libexec/zincati agent -v

Jan 10 16:43:12 apu2 zincati[1151]: [ERROR zincati::cincinnati] failed to check Cincinnati for updates: server-side error, code 502: (unknown/generic server error)
Jan 11 00:24:33 apu2 zincati[1151]: [ERROR zincati::cincinnati] failed to check Cincinnati for updates: server-side error, code 502: (unknown/generic server error)
Jan 11 05:39:54 apu2 zincati[1151]: [ERROR zincati::cincinnati] failed to check Cincinnati for updates: server-side error, code 502: (unknown/generic server error)
Jan 11 07:03:39 apu2 zincati[1151]: [INFO  zincati::update_agent::actor] target release '37.20230107.2.0' selected, proceeding to stage it
Jan 11 07:08:56 apu2 zincati[1151]: [INFO  zincati::update_agent::actor] update staged: 37.20230107.2.0

I would expect that zincati would keep checking the update graph and throw away the pending deployment and go straight to the next one if the update graph allowed for it.

Environment

Local bare metal x86_64 machine.

Expected Behavior

Pending deployment gets thrown away and newer update gets staged.

Actual Behavior

Pending (older) deployment appears to continue to be staged.

Reproduction Steps

This is hard because it requires the remote update server to be in certain states at different times. In summary:

Deploy a node with a periodic update strategy that only let's it update on certain days of the week.
Have a new release happen and the node stage an update
Have another release happen before the update window your node has set.
Notice that the system sticks with the old update and doesn't switch to the new one.

Other Information

[root@apu2 ~]# rpm-ostree status 
State: idle
AutomaticUpdatesDriver: Zincati
  DriverState: active; update staged: 37.20230107.2.0; reboot pending due to update strategy
Deployments:
  fedora:fedora/x86_64/coreos/testing
                  Version: 37.20230107.2.0 (2023-01-09T18:09:12Z)
               BaseCommit: 181c145a3c9e200439016bbc78ac3cce501f596c20f37fe927af5096f38b00fd
             GPGSignature: Valid signature by ACB5EE4E831C74BB7C168D27F55AD3FB5323552A
                     Diff: 52 upgraded, 1 removed, 1 added
          LayeredPackages: bridge-utils dmidecode firewalld flashrom iwd libimobiledevice libimobiledevice-utils lshw NetworkManager-wifi
                           pciutils speedtest-cli systemd-oomd-defaults tmux usbmuxd

● fedora:fedora/x86_64/coreos/testing
                  Version: 37.20221225.2.2 (2023-01-03T16:06:54Z)
               BaseCommit: e339f79de0d679296a875d8cb0c9d2fe39089f516ed14fb29705f472a85ccbd0
             GPGSignature: Valid signature by ACB5EE4E831C74BB7C168D27F55AD3FB5323552A
          LayeredPackages: bridge-utils dmidecode firewalld flashrom iwd libimobiledevice libimobiledevice-utils lshw NetworkManager-wifi
                           pciutils speedtest-cli systemd-oomd-defaults tmux usbmuxd

  fedora:fedora/x86_64/coreos/testing
                  Version: 37.20221225.2.1 (2022-12-26T16:01:30Z)
               BaseCommit: 5f6f5e6ec7ad1ad7c49f29a44bce2b8432dfecb876ad174e1cd29566eacf2da1
             GPGSignature: Valid signature by ACB5EE4E831C74BB7C168D27F55AD3FB5323552A
          LayeredPackages: bridge-utils dmidecode firewalld flashrom iwd libimobiledevice libimobiledevice-utils lshw NetworkManager-wifi
                           pciutils speedtest-cli systemd-oomd-defaults tmux usbmuxd
[root@apu2 ~]# 
[root@apu2 ~]# rpm -q zincati
zincati-0.0.25-1.fc37.x86_64

dustymabe commented 1 year ago

One particular reason this is important is that we typically only do ad-hoc out of cycle releases when bugs/regressions were introduced. The current behavior means we can't prevent systems with periodic update strategies from booting into the buggy release.

Furthermore their update window might not allow for another update for another period of time, so they'd be on the buggy release for even longer.

jlebon commented 1 year ago

I think this is how the state machine was designed. As much work is done upfront so that when the strategy says "go", it's just a simple reboot. Changing this sounds reasonable. E.g. in the worst case, if an update node's metadata changes to a deadend, that should absolutely block finalization and reset the state machine to go back to looking for the next update. In the case where the preferred node changed but the old node is still valid, maybe it should be up to the strategy logic whether swapping them out is permitted. For the periodic strategy, I could see an argument for not allowing it if the next window is in e.g. 10 minutes.

coreos / zincati

zincati sticks with staged deployments even if newer is available #928