bottlerocket-os / bottlerocket

An operating system designed for hosting containers
https://bottlerocket.dev
Other
8.72k stars 512 forks source link

1.21.0 Status 500 when PATCHing /settings?tx=bottlerocket-launch #4135

Closed EthanKane-FD closed 2 months ago

EthanKane-FD commented 2 months ago

Hey there, we noticed an issue today with the latest version of bottlerocket. Any help would be greatly appreciated. Our new builds picked up the latest version and our nodes are failing to boot. Image I'm using: Bottlerocket OS 1.21.0

What I expected to happen:

         Starting Bottlerocket userdata configuration system...

[  OK  ] Finished Bottlerocket userdata configuration system.

What actually happened: Bottlerocket AMI updated last night to (Bottlerocket OS 1.21.0 (aws-k8s-1.30)!) bottlerocket userdata configuration is failing.

Seeing the following in the system logs

         Starting Bottlerocket userdata configuration system...

[    3.428743] early-boot-config[1329]: Error PATCHing '/settings?tx=bottlerocket-launch': Status 500 when PATCHing /settings?tx=bottlerocket-launch: Error serializing Settings: 'unit' not allowed by Serializer
[FAILED] Failed to start Bottlerocket userdata configuration system.

See 'systemctl status early-boot-config.service' for details.

[DEPEND] Dependency failed for Bottlerocket initial configuration complete.

[DEPEND] Dependency failed for Isolates configured.target.

[DEPEND] Dependency failed for Applies settings to create config files.

[DEPEND] Dependency failed for Send signal to CloudFormation Stack.

[DEPEND] Dependency failed for Sets the hostname.

[DEPEND] Dependency failed for User-specified setting generators.

[DEPEND] Dependency failed for Generate additional settings for Kubernetes.

How to reproduce the problem:

Upgrade from 1.20.5

patkinson01 commented 2 months ago

We've just seen issues on some of our clusters trying to update to 1.21.0 too - seems a similar issue so pasting here - but if not let me know and I'll raise a separate ticket:

    Starting Generate additional settings for Kubernetes...

[ 7.882539] pluto[1498]: Unable to retrieve cluster name and AWS region from Bottlerocket API: Deserialization of configuration file failed: invalid type: sequence, expected a string at line 16 column 18 [FAILED] Failed to start Generate additional settings for Kubernetes.

See 'systemctl status pluto.service' for details.

[DEPEND] Dependency failed for Applies settings to create config files.

[DEPEND] Dependency failed for Sets the hostname.

[DEPEND] Dependency failed for Send signal to CloudFormation Stack.

[DEPEND] Dependency failed for Bottlerocket initial configuration complete.

[DEPEND] Dependency failed for Isolates configured.target.

ramseymcgrathfd commented 2 months ago

Example launch template to reproduce

"image-gc-high-threshold-percent" = "${config.image_gc_high_threshold_percent}"
"image-gc-low-threshold-percent"  = "${config.image_gc_low_threshold_percent}"
"eviction-max-pod-grace-period"   = "${config.max_pod_grace_period}"

[settings.kubernetes.node-labels]
%{ for label_key, label_value in config.labels }
"${label_key}" = "${label_value}"
%{ endfor ~}

[settings.kubernetes.node-taints]
%{ for taint_key, taint_value in config.taints }
"${taint_key}" = "${taint_value}"
%{ endfor ~}

[settings.kubernetes.credential-providers.ecr-credential-provider]
enabled = true
cache-duration = "30m"
image-patterns = [
  "*.dkr.ecr.*.amazonaws.com"
]

[settings.kubernetes.eviction-hard]
%{ for key, value in config.eviction_hard_values }
"${key}" = "${value}"
%{ endfor ~}

[settings.kubernetes.eviction-soft]
%{ for key, value in config.eviction_soft_values }
"${key}" = "${value}"
%{ endfor ~}

[settings.kubernetes.eviction-soft-grace-period]
%{ for key, value in config.soft_grace_period_values }
"${key}" = "${value}"
%{ endfor ~}

[settings.kubernetes.system-reserved]
cpu = "${config.system_reserved_cpu}"
memory = "${config.system_reserved_memory}"
ephemeral-storage = "${config.system_reserved_ephemeral}"

[settings.metrics]
# whether or not health metrics will be sent. set to false to opt-out
send-metrics = false

# Use local aws time server
[settings.ntp]
time-servers = ["169.254.169.123"]

# The admin host container provides SSH access and runs with "superpowers".
# It is disabled by default, but can be disabled explicitly.
[settings.host-containers.admin]
enabled = false

# The control host container provides out-of-band access via SSM.
# It is enabled by default, and can be disabled if you do not
# expect to use SSM. This could leave you with no way to access
# the API and change settings on an existing node!
[settings.host-containers.control]
enabled = true
yeazelm commented 2 months ago

Thank you @EthanKane-FD, @ramseymcgrathfd, and @patkinson01 for reporting this! We are looking at this now and will provide an update as soon as possible.

yeazelm commented 2 months ago

For folks that have seen this issue, if you can include the userdata to reproduce, similar to @ramseymcgrathfd, that would help a ton, if you don't want to post to GitHub but can open an AWS Support case and provide it there, that would help too.

EthanKane-FD commented 2 months ago

Hey @yeazelm, thanks for checking. Me and @ramseymcgrathfd are on the same team so that's our user data config.

patkinson01 commented 2 months ago

Hi @yeazelm, please find below our userdata:

`[settings.network] no-proxy = ${no_proxy} https-proxy = "${http_proxy}" # Squid Proxy with access to only specific approved domains

[[settings.container-registry.credentials]] registry = "${repo_url}" # Internal repo where we pull all images from (except for some managed addons which need to come from AWS ECR repos) username = "${repo_username}" password = "${repo_api_key}"

[settings.kernel.sysctl] "user.max_user_namespaces" = "0" "vm.max_map_count" = "262144" "net.ipv4.conf.all.send_redirects" = "0" #cis hardening 3.1.1 "net.ipv4.conf.default.send_redirects" = "0" #cis hardening 3.1.1 "net.ipv4.conf.all.accept_redirects" = "0" #cis hardening 3.2.2 "net.ipv4.conf.default.accept_redirects" = "0" #cis hardening 3.2.2 "net.ipv6.conf.all.accept_redirects" = "0" #cis hardening 3.2.2 "net.ipv6.conf.default.accept_redirects" = "0" #cis hardening 3.2.2 "net.ipv4.conf.all.secure_redirects" = "0" #cis hardening 3.2.3 "net.ipv4.conf.default.secure_redirects" = "0" #cis hardening 3.2.3 "net.ipv4.conf.all.log_martians" = "1" #cis hardening 3.2.4 "net.ipv4.conf.default.log_martians" = "1" #cis hardening 3.2.4

[settings.kubernetes.node-labels] "bottlerocket.aws/updater-interface-version" = "2.0.0" # Configure the node-labels Bottlerocket setting to enable BruPop updates

[settings.bootstrap-containers.bottle] source = "${repo_url}/${bottle_rocket_repo_name}/${bottle_rocket_image_name}:${bottle_rocket_image_version}" mode = "once" user-data = "${user_data}" #base64 encoded set of values used in our bottlerocket bootstrap image to configure Vault access and proxy essential = true

[settings.updates] ignore-waves = ${bottle_rocket_update_immediately} seed = ${bottle_rocket_seed}

[settings.kubernetes] api-server = "${cluster_endpoint}" cluster-certificate = "${cluster_ca_base64}" cluster-name = "${eks_cluster_id}"`

ytsssun commented 2 months ago

Example launch template to reproduce

"image-gc-high-threshold-percent" = "${config.image_gc_high_threshold_percent}"
"image-gc-low-threshold-percent"  = "${config.image_gc_low_threshold_percent}"
"eviction-max-pod-grace-period"   = "${config.max_pod_grace_period}"

[settings.kubernetes.node-labels]
%{ for label_key, label_value in config.labels }
"${label_key}" = "${label_value}"
%{ endfor ~}

[settings.kubernetes.node-taints]
%{ for taint_key, taint_value in config.taints }
"${taint_key}" = "${taint_value}"
%{ endfor ~}

[settings.kubernetes.credential-providers.ecr-credential-provider]
enabled = true
cache-duration = "30m"
image-patterns = [
  "*.dkr.ecr.*.amazonaws.com"
]

[settings.kubernetes.eviction-hard]
%{ for key, value in config.eviction_hard_values }
"${key}" = "${value}"
%{ endfor ~}

[settings.kubernetes.eviction-soft]
%{ for key, value in config.eviction_soft_values }
"${key}" = "${value}"
%{ endfor ~}

[settings.kubernetes.eviction-soft-grace-period]
%{ for key, value in config.soft_grace_period_values }
"${key}" = "${value}"
%{ endfor ~}

[settings.kubernetes.system-reserved]
cpu = "${config.system_reserved_cpu}"
memory = "${config.system_reserved_memory}"
ephemeral-storage = "${config.system_reserved_ephemeral}"

[settings.metrics]
# whether or not health metrics will be sent. set to false to opt-out
send-metrics = false

# Use local aws time server
[settings.ntp]
time-servers = ["169.254.169.123"]

# The admin host container provides SSH access and runs with "superpowers".
# It is disabled by default, but can be disabled explicitly.
[settings.host-containers.admin]
enabled = false

# The control host container provides out-of-band access via SSM.
# It is enabled by default, and can be disabled if you do not
# expect to use SSM. This could leave you with no way to access
# the API and change settings on an existing node!
[settings.host-containers.control]
enabled = true

@ramseymcgrathfd Do you by any chance have the rendered userdata? I tried apply some value to the template and failed to reproduce. Here is my userdata.

[settings.kubernetes]
"image-gc-high-threshold-percent" = 90
"image-gc-low-threshold-percent"  = 80
"eviction-max-pod-grace-period"   = 40

[settings.kubernetes.node-labels]
"name" = "my-node"

[settings.kubernetes.node-taints]
special = ["true:NoSchedule"]

[settings.kubernetes.credential-providers.ecr-credential-provider]
enabled = true
cache-duration = "30m"
image-patterns = [
  "*.dkr.ecr.*.amazonaws.com"
]

[settings.kubernetes.eviction-hard]
"memory.available" = "15%"

[settings.kubernetes.eviction-soft]
"memory.available" = "12%"

[settings.kubernetes.eviction-soft-grace-period]
"memory.available" = "30s"

[settings.kubernetes.system-reserved]
cpu = "10m"
ephemeral-storage = "1Gi"
memory = "100Mi"

[settings.metrics]
# whether or not health metrics will be sent. set to false to opt-out
send-metrics = false

# Use local aws time server
[settings.ntp]
time-servers = ["169.254.169.123"]

# The admin host container provides SSH access and runs with "superpowers".
# It is disabled by default, but can be disabled explicitly.
[settings.host-containers.admin]
enabled = false

# The control host container provides out-of-band access via SSM.
# It is enabled by default, and can be disabled if you do not
# expect to use SSM. This could leave you with no way to access
# the API and change settings on an existing node!
[settings.host-containers.control]
enabled = true

I was able to upgrade from v1.20.0 to v1.21.0. Using variant bottlerocket-aws-k8s-1.30-x86_64-v1.20.0.

[ssm-user@control]$ apiclient get os
{
  "os": {
    "arch": "x86_64",
    "build_id": "4d43022e",
    "pretty_name": "Bottlerocket OS 1.21.0 (aws-k8s-1.30)",
    "variant_id": "aws-k8s-1.30",
    "version_id": "1.21.0"
  }
}
ytsssun commented 2 months ago

I was able to reproduce this issue mentioned in - https://github.com/bottlerocket-os/bottlerocket/issues/4135#issuecomment-2278246087

My userdata

[settings.network]
no-proxy = ["localhost", "127.0.0.1"]

[settings.kernel.sysctl]
"user.max_user_namespaces" = "0"
"vm.max_map_count" = "262144"
"net.ipv4.conf.all.send_redirects" = "0" #cis hardening 3.1.1
"net.ipv4.conf.default.send_redirects" = "0" #cis hardening 3.1.1
"net.ipv4.conf.all.accept_redirects" = "0" #cis hardening 3.2.2
"net.ipv4.conf.default.accept_redirects" = "0" #cis hardening 3.2.2
"net.ipv6.conf.all.accept_redirects" = "0" #cis hardening 3.2.2
"net.ipv6.conf.default.accept_redirects" = "0" #cis hardening 3.2.2
"net.ipv4.conf.all.secure_redirects" = "0" #cis hardening 3.2.3
"net.ipv4.conf.default.secure_redirects" = "0" #cis hardening 3.2.3
"net.ipv4.conf.all.log_martians" = "1" #cis hardening 3.2.4
"net.ipv4.conf.default.log_martians" = "1" #cis hardening 3.2.4

[settings.kubernetes.node-labels]
"bottlerocket.aws/updater-interface-version" = "2.0.0" # Configure the node-labels Bottlerocket setting to enable BruPop updates

[settings.updates]
ignore-waves = true

The failure

[    3.741549] pluto[1484]: Unable to retrieve cluster name and AWS region from Bottlerocket API: Deserialization of configuration file failed: invalid type: sequence, expected a string at line 15 column 18
[FAILED] Failed to start Generate additional settings for Kubernetes.
bcressey commented 2 months ago
[ 7.882539] pluto[1498]: Unable to retrieve cluster name and AWS region from Bottlerocket API: Deserialization of configuration file failed: invalid type: sequence, expected a string at line 16 column 18

This is happening because pluto only expects a String for no-proxy, when it should take a list.

patkinson01 commented 2 months ago
[ 7.882539] pluto[1498]: Unable to retrieve cluster name and AWS region from Bottlerocket API: Deserialization of configuration file failed: invalid type: sequence, expected a string at line 16 column 18

This is happening because pluto only expects a String for no-proxy, when it should take a list.

Hi @bcressey , we’ve see the error during a BRUPOP initiated update and haven’t made any changes to our userdata or no_proxy value which is a string. Presumably this is something which has changed in this latest AMI then?

bcressey commented 2 months ago

Hi @bcressey , we’ve see the error during a BRUPOP initiated update and haven’t made any changes to our userdata or no_proxy value which is a string. Presumably this is something which has changed in this latest AMI then?

The bug is in the newer version of pluto in 1.21.0. If you have settings.network.no-proxy defined in your settings (it's not defined by default) then it would trigger this issue on upgrade. If you don't have that setting defined then there may be another pluto bug.

bcressey commented 2 months ago
[    3.428743] early-boot-config[1329]: Error PATCHing '/settings?tx=bottlerocket-launch': Status 500 when PATCHing /settings?tx=bottlerocket-launch: Error serializing Settings: 'unit' not allowed by Serializer

@sam-berning tracked this down to an issue with optional fields in the CredentialProvider structure. Omitting a field marked as optional will cause it to serialize to "null" which is then rejected by the datastore serializer.

bash-5.1# cat <<EOF > /local/user-data-defaults.toml
> [settings.kubernetes.credential-providers.ecr-credential-provider]
> enabled = true
> cache-duration = "30m"
> image-patterns = [
>   "*.dkr.ecr.*.amazonaws.com"
> ]
> EOF

bash-5.1# early-boot-config
[2024-08-09T17:52:21Z INFO  early_boot_config] early-boot-config started
[2024-08-09T17:52:21Z INFO  early_boot_config] Gathering user data providers
[2024-08-09T17:52:21Z INFO  early_boot_config] Provider '10-local-defaults': [2024-08-09T17:52:21Z INFO  early_boot_config_provider::provider] '/local/user-data-defaults.toml' exists, using it
[2024-08-09T17:52:21Z INFO  early_boot_config] Found user data via user data from /local/user-data-defaults.toml, sending to API
Error PATCHing '/settings?tx=bottlerocket-launch': Status 500 when PATCHing /settings?tx=bottlerocket-launch: Error serializing Settings: 'unit' not allowed by Serializer

Fully specifying the user data for the credential provider, by passing in a no-op environment variable, would avoid the issue:

[settings.kubernetes.credential-providers.ecr-credential-provider]
enabled = true
cache-duration = "30m"
image-patterns = [
  "*.dkr.ecr.*.amazonaws.com"
]
environment.foo = "bar"
ramseymcgrathfd commented 2 months ago

@bcressey yeah good catch, it does

reckon it'll need

    #[serde(skip_serializing_if = "Option::is_none")] 
sam-berning commented 2 months ago

reckon it'll need

    #[serde(skip_serializing_if = "Option::is_none")] 

Yup, that's indeed the right fix. Should be addressed as of https://github.com/bottlerocket-os/bottlerocket-settings-sdk/pull/51. We've also updated the datastore serializer to handle null values correctly in https://github.com/bottlerocket-os/bottlerocket-core-kit/pull/80, which should protect against this sort of bug moving forward

yeazelm commented 2 months ago

We have released 1.21.1 that should allow a good upgrade from 1.20.5. Please let us know that it solves your problem!

patkinson01 commented 2 months ago

All good, thanks for a quick turnaround!!

EthanKane-FD commented 2 months ago

Hey thanks @yeazelm , have rolled this out on a few lab clusters and everything seems to be in order. Thanks again