fedora-iot / greenboot

Generic Health Checking Framework for systemd
GNU Lesser General Public License v2.1
101 stars 29 forks source link

boot_counter not decrementing? #135

Closed vkrizan closed 3 months ago

vkrizan commented 8 months ago

Hello,

I'm failing to understand how the greenboot prevents for bootloops and how it decrements the boot_counter on each failure. Only place I've found where the variable is decremented is in the static grub config: https://github.com/fedora-iot/greenboot/blob/332e5e317e762c992d73b44ee52f28f55e27325f/grub2/greenboot.cfg#L13

This seems to depend on bootupd, however with the RPM packages in regular Fedora (dist-git) and CentOS repositories the dependency on bootupd and the file greenboot.cfg grub file cannot be found.

https://github.com/fedora-iot/greenboot/blob/332e5e317e762c992d73b44ee52f28f55e27325f/greenboot.spec#L22 https://github.com/fedora-iot/greenboot/blob/332e5e317e762c992d73b44ee52f28f55e27325f/greenboot.spec#L69

The greenboot-grub2-set-counter script is only called once without a parameter...

How should this work outside of Fedora IoT? Or what am I missing?

Thank you.

say-paul commented 8 months ago

~what is the version of greenboot ?~

This seems to depend on bootupd, however with the RPM packages in regular Fedora (dist-git) and CentOS repositories the dependency on bootupd and the file greenboot.cfg grub file cannot be found.

Greenboot needs a new release as version 0.15.4 does not have the bootup changes.

vkrizan commented 8 months ago

I was checking what the 0.15.4 included and the grub config seems to have been included. https://github.com/fedora-iot/greenboot/releases/tag/v0.15.4

Even before the PR #129, the project advertised and/or had the max boots limit. That is what confused me a bit more. What is the dependency on the bootupd?

say-paul commented 8 months ago

previously image builder used to put the grub2/greenboot.conf , PR: #129 move that out from image builder into greenboot itself. So my understanding is new release will solve the issue. I haven't tested with the patch yet, so I can check do some tests and confirm.

vkrizan commented 8 months ago

Thank you.

I've tried to manually add the grub2/greenboot.conf to the right path and installed bootupd, but with no success. I'm not sure what else needs to be configured.

Do you happen to have the link to where image builder added that? Was that only for the Fedora IoT images?

Is my understanding correct that without bootupd the boot counting is not done (and bootupd is a dependency)? Was the counting (decrementing) moved to a grub config for convenience or some advantage, rather than counting it in the systemd targes within a shell script?

say-paul commented 8 months ago

Do you happen to have the link to where image builder added that? Was that only for the Fedora IoT images?

https://github.com/osbuild/osbuild/blob/b29aa5e6517e017f545de54819aa845fb026fd1e/stages/org.osbuild.grub2#L321 the stage then gets hooked up in the image builder pipeline where greenboot is added as a default package.

say-paul commented 8 months ago

Is my understanding correct that without bootupd the boot counting is not done (and bootupd is a dependency)? Was the counting (decrementing) moved to a grub config for convenience or some advantage, rather than counting it in the systemd targes within a shell script?

historically greenboot relied on grub to decrement boot_counter, bootupd came later and I have not tested with it yet.

vkrizan commented 8 months ago

Sill no luck with the boot_counter variable. I've installed the current code base (332e5e317e762c992d73b44ee52f28f55e27325f) into a container image and added an exit 1 file for the /etc/greenboot/check/required.d/. I'm not sure if the greenboot.cfg is even loaded and run by Grub.

Here's my Containerfile:

FROM quay.io/centos-bootc/centos-bootc-cloud:stream9

RUN rpm-ostree install \
    https://download.copr.fedorainfracloud.org/results/vkrizan/greenboot/fedora-40-x86_64/07246945-greenboot/greenboot-0.15.4-1.fc40.x86_64.rpm \
    https://download.copr.fedorainfracloud.org/results/vkrizan/greenboot/fedora-40-x86_64/07246945-greenboot/greenboot-default-health-checks-0.15.4-1.fc40.x86_64.rpm \
    && systemctl enable greenboot-grub2-set-counter \
        greenboot-grub2-set-success.service greenboot-healthcheck.service \
        greenboot-loading-message.service greenboot-rpm-ostree-grub2-check-fallback.service \
        redboot-auto-reboot.service redboot-task-runner.service redboot.target \
    && ostree container commit

# Add the bad check: grub2-editenv list && exit 1
COPY --chmod=755 bad_check.sh /etc/greenboot/check/required.d/

Note that I'm using bootc switch to switch between the good (without the bad_check.sh) and bad image. I've also set the counter grub2-editenv - set boot_counter=2) and reset the boot success grub2-editenv - set boot_success=0 before rebooting to the bad image. After manually selecting second boot and inspecting journal of previous failed boots, I can clearly see that the variables are not changing.

@cgwalters would you happen to know what bootupd wizardry I am missing?

For clarity:

$ rpm -q bootupd
bootupd-202401222113.0.2.17.20.gc687978-1.el9.x86_64
$ rpm -ql greenboot | grep grub2-static
/usr/lib/bootupd/grub2-static/configs.d/greenboot.cfg

EDIT: The same applies for base quay.io/centos-bootc/fedora-bootc-cloud:eln

cgwalters commented 8 months ago

RUN rpm-ostree install

(Unrelated but any reason why this versus RUN dnf install ?)

I'm not sure if the greenboot.cfg is even loaded and run by Grub.

Hmm, to verify look at the final configuration in /boot/grub2/grub.cfg and see if it's being pulled in.

say-paul commented 8 months ago

@vkrizan can you check systemd status of all greenboot services and see if any error is reported there, I think #136 needs to be resolved first.

vkrizan commented 7 months ago

Hmm, to verify look at the final configuration in /boot/grub2/grub.cfg and see if it's being pulled in.

I do not see it being included. Check https://pastebin.com/JLVsy5hw. I do not know when the grub config is being generated, and bootupd does not have much docs.

can you check systemd status of all greenboot services and see if any error is reported there, I think https://github.com/fedora-iot/greenboot/issues/136 needs to be resolved first.

The greenboot services are all green when using the good image: https://pastebin.com/7JiQSZN5

The /boot is mounted as rw

$ mount | grep /boot
/dev/vda3 on /boot type ext4 (rw,relatime,seclabel)

Unless it has different conditions for the systemd units, this should not be an issue. And regardless of that, the boot_counter modifications are done by Grub.

(Unrelated but any reason why this versus RUN dnf install ?)

My mistake, as I saw use of the ostree commit I've stick to ostree commands (I guess the commit is then not needed). Anyhow, I've changed it to use dnf but with the recommended ostree container commit it fails with error: Found content in var even after dnf clean all. Probably it needs more cleanup or no ostree commit.

vkrizan commented 7 months ago

Is the bootupd stuff to be expected to be already injected by the initial image that the systems is first booted from? Could that be the expectation that is broken here, and subsequent bootc update/switch have no impact on it? Note, that I've started with fedora-boot-cloud.qcow2 using bootc-playground.

cgwalters commented 7 months ago

Yes indeed, that's the reason; right now the bootloader state is not updated by bootc upgrade/switch.

That's what bootupctl update does, however even that at the current time does not update the static grub configs.

If you haven't I'd recommend trying https://gitlab.com/bootc-org/podman-bootc-cli which streamlines creating VMs directly from a container, without starting from an existing disk image.

vkrizan commented 7 months ago

Thank you. I'll try that one out.

Is there a way to force the bootupctl to update static grub configs, or a manual intervention of grub.cfg is the only choice atm?

vkrizan commented 7 months ago

@cgwalters podman-bootc run <imagename> helped with the bootloader. greenboot.cfg is included. However, the greenboot systemd units were all disabled, despite having them enabled on the container image.

Once the greenboot units were enabled, the rollback (currently using ostree) went as expected. Hence I can conclude the boot_counter was being decremented.

cc @say-paul