Description of changes:
This PR best left in commit-order, as it also performs several maintenance chores.
agent: eliminate potential races while rebooting
The agent had a few conditions under which it can enter the
MonitoringUpdate state before it successfully uncordons the node:
* Brupop updates its state into Rebooted before the reboot terminates
the process
* After rebooting, an error ocurrs after updating the state but before
performing the uncordon.
We avoid this conditions by:
* Exiting the agent process in cases where the desired state is Rebooted
but we are not yet running the new version.
* Always uncordoning the node before marking that we have successfully
transitioned into the Rebooted state.
* Defensively uncordoning the node once again when we enter the
Monitoring state.
Testing done:
I built a custom Bottlerocket variant which always waits for a minute before performing a reboot, then used this to test these changes. In doing this, I was able to reliably trigger the race condition, and then also confirm that this new code does not misbehave when reboots occur slowly.
Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.
Issue number: Closes #630
Description of changes: This PR best left in commit-order, as it also performs several maintenance chores.
Testing done: I built a custom Bottlerocket variant which always waits for a minute before performing a reboot, then used this to test these changes. In doing this, I was able to reliably trigger the race condition, and then also confirm that this new code does not misbehave when reboots occur slowly.
Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.