bottlerocket-os / bottlerocket-update-operator

A Kubernetes operator for automated updates to Bottlerocket
Other
179 stars 41 forks source link

Eliminate potential race conditions while rebooting #631

Closed cbgbt closed 5 months ago

cbgbt commented 5 months ago

Issue number: Closes #630

Description of changes: This PR best left in commit-order, as it also performs several maintenance chores.

    agent: eliminate potential races while rebooting

    The agent had a few conditions under which it can enter the
    MonitoringUpdate state before it successfully uncordons the node:

    * Brupop updates its state into Rebooted before the reboot terminates
      the process
    * After rebooting, an error ocurrs after updating the state but before
      performing the uncordon.

    We avoid this conditions by:
    * Exiting the agent process in cases where the desired state is Rebooted
      but we are not yet running the new version.
    * Always uncordoning the node before marking that we have successfully
      transitioned into the Rebooted state.
    * Defensively uncordoning the node once again when we enter the
      Monitoring state.

Testing done: I built a custom Bottlerocket variant which always waits for a minute before performing a reboot, then used this to test these changes. In doing this, I was able to reliably trigger the race condition, and then also confirm that this new code does not misbehave when reboots occur slowly.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.