Open JulesClaussen opened 3 weeks ago
Hi Jules, thanks for the suggestion.
This is something that we could consider. Brupop itself currently uses the Bottlerocket host’s API to determine what version to upgrade to, and when. If we wanted to implement this, we’d probably want some way to configure a Bottlerocket node via userdata to perform some kind of offset of a given update wave schedule.
I’ve opened an issue in the Bottlerocket repo to track that here.
I also think it’s worth considering an alternate approach. Because brupop uses the host’s API to determine what to do, you can control brupop’s behavior using the value of settings.updates.version-lock. I think a strategy which pushes new values through a service’s dev environment before promoting it to prod will provide stronger guarantees that any given version of Bottlerocket is tested and compatible with the version of your service that will be promoted to prod.
The right way to manage this value would depend on how deployments are orchestrated in a cluster, but we’ve previously discussed implementing a k8s operator which could be used to manage Bottlerocket settings values in a cluster. Here’s a separate issue with context in the brupop repo.
Interested to hear your thoughts.
I didn't think about the settings.updates.version-lock, that would indeed work. It requires quite some setup to push the new version etc. I don't think it's the best fit for our infra, but that's good for huge scale.
The k8s operator would indeed be quite nice, I have up-voted the issue as well.
You're right. In-fine, the approach I suggest is a bad practice overall. Because in the case of a security vulnerability for example, we wouldn't want to wait 1 week to roll-out, so we'd have to manually change the version. It does not make sense in term of CI/CD, security and good practices in general. I think the best approach would be as you suggested, to be able to pin the version, and have some automation software such as renovate to manage the version upgrades. This way we could validate on dev environment, and roll-back easily.
Thanks for your quick feedback!
Hello team,
Following this morning/night incident on Bottlerocket image 1.26.0, we are thinking of ways to mitigate such issues, on our ends. One idea (similar to what can be configured on renovate) is to have a duration to wait after a Bottlerocket version release to consider it "safe". That would allow us to wait at least 1 week for example, after an image is released to be considered as a valid update for our bottlerocket images.
What do you think?
Thanks, Jules