Closed dhensel-rh closed 2 months ago
@dhensel-rh rpm-ostree reset
as per the documentation says removes any mutation, so when the package hostname
that gets removed as part of rpm-ostree override remove hostname
gets restored when reset is triggered, You can test it by checking rpm-ostree status
before and after step-2.
Though I am not sure why the boot_counter is still set, please share the journald log post reboot of the services: greenboot-grub2-set-counter, greenboot-healthcheck, greenboot-grub2-set-success
This looks like https://bugzilla.redhat.com/show_bug.cgi?id=2185901 ?
If I'm not mistaken rpm-ostree override remove <pkg>
will trigger ostree-finalize-staged.service
, which will pull in greenboot-grub2-set-counter.service
with ExecStart=/usr/libexec/greenboot/greenboot-grub2-set-counter
.
It's possible that also rpm-ostree reset
triggers ostree-finalize-staged.service
again (I don't know whether it does).
Either way, there is nothing telling grub to unset the boot_counter variable again in this case.
If rpm-ostree reset
triggers ostree-finalize-staged.service
a second time, it might suffice to make the greenboot-grub2-set-counter
script smarter here (e.g. by checking rpm-ostree status
and somehow determining that the last action was reset
, and then unsetting the boot_counter var).
sudo journalctl -o cat -u greenboot-grub2-set-success
Starting Mark boot as successful in grubenv...
Finished Mark boot as successful in grubenv.
sudo journalctl -o cat -u greenboot-grub2-set-counter
Starting Set grub2 boot counter in preparation of upgrade...
GRUB2 environment variables have been set for system upgrade. Max boot attempts is 3
Finished Set grub2 boot counter in preparation of upgrade.
greenboot-grub2-set-counter.service: Deactivated successfully.
Stopped Set grub2 boot counter in preparation of upgrade.
sudo journalctl -o cat -u greenboot-healthcheck
Starting greenboot Health Checks Runner...
Running Required Health Check Scripts...
Running greenboot Required Health Check Scripts
Script '00_required_scripts_start.sh' SUCCESS
No domain names have been found
Script '01_repository_dns_check.sh' SUCCESS
No watchdog on the system, skipping check
Script '02_watchdog.sh' SUCCESS
Running Wanted Health Check Scripts...
Script '00_wanted_scripts_start.sh' SUCCESS
Script '01_update_platforms_check.sh' FAILURE (exit code '1'). Continuing...
Running Required Health Check Scripts...
STARTED
GRUB boot variables:
boot_success=0
boot_indeterminate=0
Greenboot variables:
GREENBOOT_WATCHDOG_CHECK_ENABLED=true
The ostree status:
* rhel fb16b8ed2ac800af1fc949a2b6be9f0ba8eb9248a0927b69ffdd30e0342d379e.0
Version: 9.2
origin refspec: edge:rhel/9/aarch64/edge
Waiting 300s for MicroShift service to be active and not failed
Waiting 300s for MicroShift API health endpoints to be OK
Waiting 300s for any pods to be running
Waiting 300s for pod image(s) from the 'openshift-ovn-kubernetes' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-service-ca' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-ingress' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-dns' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-storage' namespace to be downloaded
Waiting 300s for pod image(s) from the 'kube-system' namespace to be downloaded
Waiting 300s for 2 pod(s) from the 'openshift-ovn-kubernetes' namespace to be in 'Ready' state
Waiting 300s for 1 pod(s) from the 'openshift-service-ca' namespace to be in 'Ready' state
Waiting 300s for 1 pod(s) from the 'openshift-ingress' namespace to be in 'Ready' state
Waiting 300s for 2 pod(s) from the 'openshift-dns' namespace to be in 'Ready' state
Waiting 300s for 2 pod(s) from the 'openshift-storage' namespace to be in 'Ready' state
Waiting 300s for 2 pod(s) from the 'kube-system' namespace to be in 'Ready' state
Checking pod restart count in the 'openshift-ovn-kubernetes' namespace
Checking pod restart count in the 'openshift-service-ca' namespace
Checking pod restart count in the 'openshift-ingress' namespace
Checking pod restart count in the 'openshift-dns' namespace
Checking pod restart count in the 'openshift-storage' namespace
Checking pod restart count in the 'kube-system' namespace
FINISHED
Script '40_microshift_running_check.sh' SUCCESS
Running Wanted Health Check Scripts...
Finished greenboot Health Checks Runner.
greenboot-healthcheck.service: Deactivated successfully.
Stopped greenboot Health Checks Runner.
greenboot-healthcheck.service: Consumed 57.582s CPU time.
Starting greenboot Health Checks Runner...
Running Required Health Check Scripts...
Running greenboot Required Health Check Scripts
Script '00_required_scripts_start.sh' SUCCESS
No domain names have been found
Script '01_repository_dns_check.sh' SUCCESS
Script '02_watchdog.sh' SUCCESS
Running Wanted Health Check Scripts...
Running greenboot Wanted Health Check Scripts
Script '00_wanted_scripts_start.sh' SUCCESS
Script '01_update_platforms_check.sh' FAILURE (exit code '1'). Continuing...
Running Required Health Check Scripts...
STARTED
GRUB boot variables:
boot_success=0
boot_indeterminate=0
boot_counter=3
Greenboot variables:
GREENBOOT_WATCHDOG_CHECK_ENABLED=true
The ostree status:
* rhel fb16b8ed2ac800af1fc949a2b6be9f0ba8eb9248a0927b69ffdd30e0342d379e.1
Version: 9.2
origin refspec: edge:rhel/9/aarch64/edge
rhel fb16b8ed2ac800af1fc949a2b6be9f0ba8eb9248a0927b69ffdd30e0342d379e.0 (rollback)
Version: 9.2
origin refspec: edge:rhel/9/aarch64/edge
Waiting 300s for MicroShift service to be active and not failed
Waiting 300s for MicroShift API health endpoints to be OK
Waiting 300s for any pods to be running
Waiting 300s for pod image(s) from the 'openshift-ovn-kubernetes' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-service-ca' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-ingress' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-dns' namespace to be downloaded
Waiting 300s for pod image(s) from the 'openshift-storage' namespace to be downloaded
Waiting 300s for pod image(s) from the 'kube-system' namespace to be downloaded
Waiting 300s for 2 pod(s) from the 'openshift-ovn-kubernetes' namespace to be in 'Ready' state
Waiting 300s for 1 pod(s) from the 'openshift-service-ca' namespace to be in 'Ready' state
Waiting 300s for 1 pod(s) from the 'openshift-ingress' namespace to be in 'Ready' state
Waiting 300s for 2 pod(s) from the 'openshift-dns' namespace to be in 'Ready' state
Waiting 300s for 2 pod(s) from the 'openshift-storage' namespace to be in 'Ready' state
Waiting 300s for 2 pod(s) from the 'kube-system' namespace to be in 'Ready' state
Checking pod restart count in the 'openshift-ovn-kubernetes' namespace
Checking pod restart count in the 'openshift-service-ca' namespace
Checking pod restart count in the 'openshift-ingress' namespace
Checking pod restart count in the 'openshift-dns' namespace
Checking pod restart count in the 'openshift-storage' namespace
Checking pod restart count in the 'kube-system' namespace
FINISHED
Script '40_microshift_running_check.sh' SUCCESS
Running Wanted Health Check Scripts...
Finished greenboot Health Checks Runner.
So greenboot needs as a failure in healthcehck script to start taking any action(reboot/rollback), A need to add a script possibly in the required.d/ which can check for the packages of interest
and return failure in case of any discrepancies.
Issue: When rpm-ostree command removes a package, and a rpm-ostree reset (Remove all mutations) is performed, it affects the behavior of Greenboot. After a reboot is initiated, Greenboot does not perform the check, and the Greenboot boot_counter stats at the set default value.
Steps to reproduce:
Expected Result: Greenboot should noticed a package is missing and attempt to restore the last known good state
Actual Result: Greenboot does not attempt to fix itself. Greenboot boot_counter remains set at the default set value
additional notes (is any): The ostree status: GREENBOOT_WATCHDOG_CHECK_ENABLED=true Greenboot variables: boot_counter=2
A system reboot will clear the boot flag, and restore the system to a good known state