fedora-iot / greenboot

Generic Health Checking Framework for systemd
GNU Lesser General Public License v2.1
101 stars 29 forks source link

Greenboot does not fail when first error is found #143

Open dhensel-rh opened 4 months ago

dhensel-rh commented 4 months ago

This issue is observed in MicroShift project. When multiple start scripts are on the system, and an error is encountered in the first script, Greenboot does not immediately fail. Using default settings, Greenboot checks all of them before rebooting (in 4.16, this is 3 scripts 40_microshift_running_check.sh, 41_microshift_running_check_multus.sh, 50_microshift_running_check_olm.sh). This can add delay to the system rolling back to a known good state. This might be intended design, but it can add time before a rollback can occur.

On the first boot, each script can take up to 5 minutes to check. That is ~15 minutes. On the second boot, this increases to 10 minutes per script. That is ~30 minutes. On the third boot, each script takes up to 15 minutes.

Can this be optimized in some way ?

nullr0ute commented 4 months ago

I think the microshift scripts are part of microshift, the bug would need to be reported there as those are the scripts that are taking the time

dhensel-rh commented 4 months ago

I opened a ticket wit hMicroShift and discussed this with Gregory Giguashvili previously. He felt this was a Greenboot feature enhancement.

'Looking at https://github.com/fedora-iot/greenboot/blob/main/usr/libexec/greenboot/greenboot#L69, the greenboot check continues the loop until all scripts are run and only checks for the error condition at https://github.com/fedora-iot/greenboot/blob/main/usr/libexec/greenboot/greenboot#L78.`

nullr0ute commented 4 months ago

Can you provide a link to the microshift ticket, the details provided to date don't provide that context. Also links to where the mentioned scripts can be found.

dhensel-rh commented 4 months ago

Oh yes. i was not sure if you could view the ticket. Apologies

https://issues.redhat.com/browse/USHIFT-3165

path to MicroShift greenboot files

nullr0ute commented 4 months ago

Seems like it should be a RFE to have an option to either 1) run all tests 2) fail at the first failure. To quote the full comment:

` Douglas Hensel , this is by design of the current greenboot implementation.

Looking at https://github.com/fedora-iot/greenboot/blob/main/usr/libexec/greenboot/greenboot#L69, the greenboot check continues the loop until all scripts are run and only checks for the error condition at https://github.com/fedora-iot/greenboot/blob/main/usr/libexec/greenboot/greenboot#L78.

May I suggest closing this JIRA issue and opening an upstream feature request at https://github.com/fedora-iot/greenboot/issues? `

nullr0ute commented 4 months ago

It also looks like microshift is randomly copying/forking things, it would be useful if those could be actually sent back upstream rather than forking as it also makes it hard to know if the bug is upstream or in your fork :)

https://github.com/openshift/microshift/commit/556a91a30dabffde1fe248ba01c85940bbdac928

dhensel-rh commented 4 months ago

@nullr0ute Do you need me to do anything to start an RFE ?

ggiguash commented 4 months ago

@nullr0ute , let me comment on the issues you're raising.

Seems like it should be a RFE to have an option to either 1) run all tests 2) fail at the first failure.

The comment you're referring to in the JIRA ticket was our best guess on the current greenboot implementation. From the experience we have, I cannot think of a use-case when I would want to continue running greenboot scripts following a failure. It just delays the inevitable. Nevertheless, if we want to keep the current default behavior, we should add an option to change it. Can we consider this issue as an RFE for the requested functionality?

microshift is randomly copying/forking things: https://github.com/openshift/microshift/commit/556a91a30dabffde1fe248ba01c85940bbdac928

Note that the commit you're referring to is not a copy / fork of any greenboot upstream functionality. It's a bug fix in MicroShift internal scripts. These scripts are implementing MicroShift-specific health-check functionality and they cannot be used upstream in the generic greenboot code.