fedora-iot / greenboot

Generic Health Checking Framework for systemd
GNU Lesser General Public License v2.1
101 stars 29 forks source link

Greenboot boot_counter does not decrement if sudo rpm-ostree reset is run before reboot #107

Closed dhensel-rh closed 2 months ago

dhensel-rh commented 1 year ago

Issue: When rpm-ostree command removes a package, and a rpm-ostree reset (Remove all mutations) is performed, it affects the behavior of Greenboot. After a reboot is initiated, Greenboot does not perform the check, and the Greenboot boot_counter stats at the set default value.

Steps to reproduce:

  1. Deploy a RHEL system with Greenboot and rpm-ostree both actively installed
  2. Remove a package rpm-ostree override remove hostname
  3. Reset rpm-ostree rpm-ostree reset
  4. Perform a system reboot

Expected Result: Greenboot should noticed a package is missing and attempt to restore the last known good state

Actual Result: Greenboot does not attempt to fix itself. Greenboot boot_counter remains set at the default set value

additional notes (is any): The ostree status: GREENBOOT_WATCHDOG_CHECK_ENABLED=true Greenboot variables: boot_counter=2

A system reboot will clear the boot flag, and restore the system to a good known state

say-paul commented 1 year ago

@dhensel-rh rpm-ostree reset as per the documentation says removes any mutation, so when the package hostname that gets removed as part of rpm-ostree override remove hostname gets restored when reset is triggered, You can test it by checking rpm-ostree status before and after step-2.

Though I am not sure why the boot_counter is still set, please share the journald log post reboot of the services: greenboot-grub2-set-counter, greenboot-healthcheck, greenboot-grub2-set-success

miabbott commented 1 year ago

This looks like https://bugzilla.redhat.com/show_bug.cgi?id=2185901 ?

LorbusChris commented 1 year ago

If I'm not mistaken rpm-ostree override remove <pkg> will trigger ostree-finalize-staged.service, which will pull in greenboot-grub2-set-counter.service with ExecStart=/usr/libexec/greenboot/greenboot-grub2-set-counter.

It's possible that also rpm-ostree reset triggers ostree-finalize-staged.service again (I don't know whether it does).

Either way, there is nothing telling grub to unset the boot_counter variable again in this case. If rpm-ostree reset triggers ostree-finalize-staged.service a second time, it might suffice to make the greenboot-grub2-set-counter script smarter here (e.g. by checking rpm-ostree status and somehow determining that the last action was reset, and then unsetting the boot_counter var).

dhensel-rh commented 1 year ago

sudo journalctl -o cat -u greenboot-grub2-set-success

Starting Mark boot as successful in grubenv...
Finished Mark boot as successful in grubenv.

sudo journalctl -o cat -u greenboot-grub2-set-counter

Starting Set grub2 boot counter in preparation of upgrade...
GRUB2 environment variables have been set for system upgrade. Max boot attempts is 3
Finished Set grub2 boot counter in preparation of upgrade.
greenboot-grub2-set-counter.service: Deactivated successfully.
Stopped Set grub2 boot counter in preparation of upgrade.

sudo journalctl -o cat -u greenboot-healthcheck

Starting greenboot Health Checks Runner...
Running Required Health Check Scripts...
Running greenboot Required Health Check Scripts
Script '00_required_scripts_start.sh' SUCCESS
No domain names have been found
Script '01_repository_dns_check.sh' SUCCESS
No watchdog on the system, skipping check
Script '02_watchdog.sh' SUCCESS
Running Wanted Health Check Scripts...
Script '00_wanted_scripts_start.sh' SUCCESS
Script '01_update_platforms_check.sh' FAILURE (exit code '1'). Continuing...
Running Required Health Check Scripts...
STARTED
GRUB boot variables:
boot_success=0
boot_indeterminate=0
Greenboot variables:
GREENBOOT_WATCHDOG_CHECK_ENABLED=true
The ostree status:
* rhel fb16b8ed2ac800af1fc949a2b6be9f0ba8eb9248a0927b69ffdd30e0342d379e.0
    Version: 9.2
    origin refspec: edge:rhel/9/aarch64/edge
Waiting 300s for MicroShift service to be active and not failed
Waiting 300s for MicroShift API health endpoints to be OK
Waiting 300s for any pods to be running
Waiting 300s for pod image(s) from the 'openshift-ovn-kubernetes' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-service-ca' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-ingress' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-dns' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-storage' namespace to be downloaded
Waiting 300s for pod image(s) from the 'kube-system' namespace to be downloaded
Waiting 300s for 2 pod(s) from the 'openshift-ovn-kubernetes' namespace to be in 'Ready' state
Waiting 300s for 1 pod(s) from the 'openshift-service-ca' namespace to be in 'Ready' state
Waiting 300s for 1 pod(s) from the 'openshift-ingress' namespace to be in 'Ready' state
Waiting 300s for 2 pod(s) from the 'openshift-dns' namespace to be in 'Ready' state
Waiting 300s for 2 pod(s) from the 'openshift-storage' namespace to be in 'Ready' state
Waiting 300s for 2 pod(s) from the 'kube-system' namespace to be in 'Ready' state
Checking pod restart count in the 'openshift-ovn-kubernetes' namespace
Checking pod restart count in the 'openshift-service-ca' namespace
Checking pod restart count in the 'openshift-ingress' namespace
Checking pod restart count in the 'openshift-dns' namespace
Checking pod restart count in the 'openshift-storage' namespace
Checking pod restart count in the 'kube-system' namespace
FINISHED
Script '40_microshift_running_check.sh' SUCCESS
Running Wanted Health Check Scripts...
Finished greenboot Health Checks Runner.
greenboot-healthcheck.service: Deactivated successfully.
Stopped greenboot Health Checks Runner.
greenboot-healthcheck.service: Consumed 57.582s CPU time.
Starting greenboot Health Checks Runner...
Running Required Health Check Scripts...
Running greenboot Required Health Check Scripts
Script '00_required_scripts_start.sh' SUCCESS
No domain names have been found
Script '01_repository_dns_check.sh' SUCCESS
Script '02_watchdog.sh' SUCCESS
Running Wanted Health Check Scripts...
Running greenboot Wanted Health Check Scripts
Script '00_wanted_scripts_start.sh' SUCCESS
Script '01_update_platforms_check.sh' FAILURE (exit code '1'). Continuing...
Running Required Health Check Scripts...
STARTED
GRUB boot variables:
boot_success=0
boot_indeterminate=0
boot_counter=3
Greenboot variables:
GREENBOOT_WATCHDOG_CHECK_ENABLED=true
The ostree status:
* rhel fb16b8ed2ac800af1fc949a2b6be9f0ba8eb9248a0927b69ffdd30e0342d379e.1
    Version: 9.2
    origin refspec: edge:rhel/9/aarch64/edge
  rhel fb16b8ed2ac800af1fc949a2b6be9f0ba8eb9248a0927b69ffdd30e0342d379e.0 (rollback)
    Version: 9.2
    origin refspec: edge:rhel/9/aarch64/edge
Waiting 300s for MicroShift service to be active and not failed
Waiting 300s for MicroShift API health endpoints to be OK
Waiting 300s for any pods to be running
Waiting 300s for pod image(s) from the 'openshift-ovn-kubernetes' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-service-ca' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-ingress' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-dns' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-storage' namespace to be downloaded
Waiting 300s for pod image(s) from the 'kube-system' namespace to be downloaded
Waiting 300s for 2 pod(s) from the 'openshift-ovn-kubernetes' namespace to be in 'Ready' state
Waiting 300s for 1 pod(s) from the 'openshift-service-ca' namespace to be in 'Ready' state
Waiting 300s for 1 pod(s) from the 'openshift-ingress' namespace to be in 'Ready' state
Waiting 300s for 2 pod(s) from the 'openshift-dns' namespace to be in 'Ready' state
Waiting 300s for 2 pod(s) from the 'openshift-storage' namespace to be in 'Ready' state
Waiting 300s for 2 pod(s) from the 'kube-system' namespace to be in 'Ready' state
Checking pod restart count in the 'openshift-ovn-kubernetes' namespace
Checking pod restart count in the 'openshift-service-ca' namespace
Checking pod restart count in the 'openshift-ingress' namespace
Checking pod restart count in the 'openshift-dns' namespace
Checking pod restart count in the 'openshift-storage' namespace
Checking pod restart count in the 'kube-system' namespace
FINISHED
Script '40_microshift_running_check.sh' SUCCESS
Running Wanted Health Check Scripts...
Finished greenboot Health Checks Runner.
say-paul commented 2 months ago

So greenboot needs as a failure in healthcehck script to start taking any action(reboot/rollback), A need to add a script possibly in the required.d/ which can check for the packages of interest and return failure in case of any discrepancies.