Unable to operate on BottlerocketShadow: 'Unable to get Bottlerocket node 'status' because of missing 'status' value'"

perplexa commented 1 year ago

Hi,

We are running the Helm chart v1.3.0 of brupop and see regular agent errors affecting newly spawned spot instances. The operator stabilizes after restarting a few (usually 1-3) times and starts working as expected. I have attached the logs for more details.

Helm Values

exclude_from_lb_wait_time_in_sec: "0"
logging:
  ansi_enabled: "false"
  formatter: json
max_concurrent_updates: "1"
namespace: brupop-bottlerocket-aws
placement:
  agent:
    tolerations:
    - operator: Exists
  apiserver:
    tolerations:
    - operator: Exists
  controller:
    tolerations:
    - operator: Exists
scheduler_cron_expression: 0 * * * * * *

Example Logs

{"timestamp":"2023-10-24T08:41:07.298595Z","level":"INFO","fields":{"message":"Created K8s API Server client using service port","service_port":"443"},"target":"apiserver::client::webclient"}
{"timestamp":"2023-10-24T08:41:08.399114Z","level":"INFO","fields":{"message":"Brs has been created.","brs_name":"\"ip-xxx-xxx-xxx-xxx.xxxxx.compute.internal\""},"target":"agent::agentclient","span":{"name":"create_metadata_shadow"},"spans":[{"name":"run"},{"name":"create_shadow_if_not_exist"},{"name":"create_metadata_shadow"}]}
{"timestamp":"2023-10-24T08:41:08.901165Z","level":"INFO","fields":{"message":"Checking for Bottlerocket updates."},"target":"agent::agentclient","span":{"shadow_error_info":"ShadowErrorInfo { crash_count: 0, state_transition_failure_timestamp: None }","name":"shadow_status_with_refreshed_system_matadata"},"spans":[{"name":"run"},{"name":"initialize_shadow_if_not_initialized"},{"name":"initialize_metadata_shadow"},{"shadow_error_info":"ShadowErrorInfo { crash_count: 0, state_transition_failure_timestamp: None }","name":"shadow_status_with_refreshed_system_matadata"}]}
{"timestamp":"2023-10-24T08:41:09.398613Z","level":"INFO","fields":{"message":"The lock for the update API is held by another process ..."},"target":"agent::apiclient::api","span":{"args":"[\"raw\", \"-u\", \"/updates/status\"]","name":"invoke_apiclient"},"spans":[{"name":"run"},{"name":"initialize_shadow_if_not_initialized"},{"name":"initialize_metadata_shadow"},{"shadow_error_info":"ShadowErrorInfo { crash_count: 0, state_transition_failure_timestamp: None }","name":"shadow_status_with_refreshed_system_matadata"},{"name":"get_update_status"},{"args":"[\"raw\", \"-u\", \"/updates/status\"]","name":"invoke_apiclient"}]}
{"timestamp":"2023-10-24T08:41:19.324755Z","level":"INFO","fields":{"message":"Brs status has been updated.","brs_name":"\"ip-xxx-xxx-xxx-xxx.xxxxx.compute.internal\"","brs_status":"BottlerocketShadowStatus { current_version: \"1.15.1\", target_version: \"1.15.1\", current_state: Idle, crash_count: 0, state_transition_failure_timestamp: None }"},"target":"agent::agentclient","span":{"name":"update_metadata_shadow"},"spans":[{"name":"run"},{"name":"initialize_shadow_if_not_initialized"},{"name":"initialize_metadata_shadow"},{"name":"update_metadata_shadow"}]}
{"timestamp":"2023-10-24T08:41:19.398602Z","level":"ERROR","fields":{"error":"Unable to operate on BottlerocketShadow: 'Unable to get Bottlerocket node 'status' because of missing 'status' value'"},"target":"agent::agentclient","span":{"name":"run"},"spans":[{"name":"run"}]}
{"timestamp":"2023-10-24T08:41:19.398838Z","level":"ERROR","fields":{"message":"Agent runner exited"},"target":"agent"}

yeazelm commented 1 year ago

Thanks @perplexa for opening this issue, we will take a look and see if we can figure out what is going on! I just want to clarify, have you seen this before on older versions, or just the one deployed via v1.3.0 helm charts?

perplexa commented 1 year ago

According to our monitoring we have had brupop agent restarts before, but it got noticeably more since I migrated from 1.0.0 to 1.3.0.

I have not looked into it myself but I assume it might relate to the cron expression (which I have set fairly aggressively) and the agent trying to operate on a shadow before it is actually available on a new node.

The graph shows restarts per pod across all brupop agent pods in the past 60 days. I switched to 1.3.0 around 2023-10-12.

cbgbt commented 1 year ago

I have a theory on what's going on here. As some background, the overall picture of how the Shadow is updated and the agent becomes aware of changes is something like this:

Untitled-2023-10-25-0955(1)

Brupop's apiserver is responsible for authorizing update requests, by checking that the Shadow update is coming from the associated node. The agent uses k8s WATCH APIs as an efficient alternative for "polling" of the shadow state. We do this because the controller can also change the BottlerocketShadow object.

Based on the logs, it looks like what's happening is that the agent is successfully completing an UPDATE call to the brupop apiserver, but then probably too-quickly assuming that the change will be reflected by the local watch. When you get this message:

'Unable to get Bottlerocket node 'status' because of missing 'status' value'

It's a local copy of the BottlerocketShadow that the agent has in memory which is missing a status.

cbgbt commented 1 year ago

If this is true, we should be able to resolve this by ensuring the agent retries looking at its local state in the reflector for some time if the status is missing.

cbgbt commented 1 year ago

I assume it might relate to the cron expression (which I have set fairly aggressively)

FWIW: The cron expressions cannot make brupop more aggressive than it is by default, so I wouldn't worry about this.

perplexa commented 1 year ago

Might be unrelated, but there's also quite a lot of the below error in EKS's apiserver logs (on a new cluster & fresh brupop Helm install):

E1027 14:33:11.052419      11 cacher.go:470] cacher (bottlerocketshadows.brupop.bottlerocket.aws): unexpected ListAndWatch error: failed to list brupop.bottlerocket.aws/v1, Kind=BottlerocketShadow: conversion webhook for brupop.bottlerocket.aws/v2, Kind=BottlerocketShadow failed: Post "https://brupop-apiserver.brupop-bottlerocket-aws.svc:443/crdconvert?timeout=30s": tls: failed to verify certificate: x509: certificate signed by unknown authority; reinitializing...

cbgbt commented 1 year ago

@perplexa I believe you are hitting https://github.com/bottlerocket-os/bottlerocket-update-operator/issues/486. Sorry about that.

We are looking to include a fix for this in the next Brupop release.

vyaghras commented 1 year ago

Pull request https://github.com/bottlerocket-os/bottlerocket-update-operator/pull/572 resolves this issue. This fix will be available as part of next Brupop release.

perplexa commented 1 year ago

awesome, thanks!

bottlerocket-os / bottlerocket-update-operator

Unable to operate on BottlerocketShadow: 'Unable to get Bottlerocket node 'status' because of missing 'status' value'" #568

Helm Values

Example Logs