Closed perplexa closed 1 year ago
Thanks @perplexa for opening this issue, we will take a look and see if we can figure out what is going on! I just want to clarify, have you seen this before on older versions, or just the one deployed via v1.3.0 helm charts?
According to our monitoring we have had brupop agent restarts before, but it got noticeably more since I migrated from 1.0.0 to 1.3.0.
I have not looked into it myself but I assume it might relate to the cron expression (which I have set fairly aggressively) and the agent trying to operate on a shadow before it is actually available on a new node.
The graph shows restarts per pod across all brupop agent pods in the past 60 days. I switched to 1.3.0 around 2023-10-12.
I have a theory on what's going on here. As some background, the overall picture of how the Shadow
is updated and the agent becomes aware of changes is something like this:
Brupop's apiserver
is responsible for authorizing update requests, by checking that the Shadow
update is coming from the associated node. The agent
uses k8s WATCH
APIs as an efficient alternative for "polling" of the shadow state. We do this because the controller can also change the BottlerocketShadow
object.
Based on the logs, it looks like what's happening is that the agent is successfully completing an UPDATE
call to the brupop apiserver, but then probably too-quickly assuming that the change will be reflected by the local watch. When you get this message:
'Unable to get Bottlerocket node 'status' because of missing 'status' value'
It's a local copy of the BottlerocketShadow
that the agent has in memory which is missing a status.
If this is true, we should be able to resolve this by ensuring the agent retries looking at its local state in the reflector for some time if the status
is missing.
I assume it might relate to the cron expression (which I have set fairly aggressively)
FWIW: The cron expressions cannot make brupop more aggressive than it is by default, so I wouldn't worry about this.
Might be unrelated, but there's also quite a lot of the below error in EKS's apiserver logs (on a new cluster & fresh brupop Helm install):
E1027 14:33:11.052419 11 cacher.go:470] cacher (bottlerocketshadows.brupop.bottlerocket.aws): unexpected ListAndWatch error: failed to list brupop.bottlerocket.aws/v1, Kind=BottlerocketShadow: conversion webhook for brupop.bottlerocket.aws/v2, Kind=BottlerocketShadow failed: Post "https://brupop-apiserver.brupop-bottlerocket-aws.svc:443/crdconvert?timeout=30s": tls: failed to verify certificate: x509: certificate signed by unknown authority; reinitializing...
@perplexa I believe you are hitting https://github.com/bottlerocket-os/bottlerocket-update-operator/issues/486. Sorry about that.
We are looking to include a fix for this in the next Brupop release.
Pull request https://github.com/bottlerocket-os/bottlerocket-update-operator/pull/572 resolves this issue. This fix will be available as part of next Brupop release.
awesome, thanks!
Hi,
We are running the Helm chart v1.3.0 of brupop and see regular agent errors affecting newly spawned spot instances. The operator stabilizes after restarting a few (usually 1-3) times and starts working as expected. I have attached the logs for more details.
Helm Values
Example Logs