Changes to kernel module compression can break certain workflows

bottlerocket-os / bottlerocket

An operating system designed for hosting containers

https://bottlerocket.dev

Other

8.67k stars 512 forks source link

Changes to kernel module compression can break certain workflows #3968

Closed vigh-m closed 3 months ago

vigh-m commented 4 months ago

Image I'm using: All Bottlerocket OS 1.20.0 variants

What I expected to happen: Updating to Bottlerocket v1.20.0 should not have broken any existing container workflows.

What actually happened: Launching Bottlerocket v1.20.0 can result in unexpected errors like:

level=warning msg="iptables modules could not be initialized. It probably means that iptables is not available on this system" error="could not load module iptable_raw: exit status

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process:

Root Cause: Following changes in Bottlerocket v1.20.0 we dropped support for xz compressed modules in favour of gz compression. We're seeing failures to launch certain containers which require kernel modules and have modprobe built without libz (gzip) support.

Known Affected systems:

Cilium

Workaround: A workaround would be to use user data on new launches or apiclient in running hosts to setup kernel.modules.<name>.autoload to load the required modules as described here.

In the case of Cilium the commands would like the following when using the apiclient:

apiclient set \
  settings.kernel.modules.ip_tables.allowed=true \
  settings.kernel.modules.ip_tables.autoload=true \
  settings.kernel.modules.iptable_nat.allowed=true \
  settings.kernel.modules.iptable_nat.autoload=true \
  settings.kernel.modules.iptable_mangle.allowed=true \
  settings.kernel.modules.iptable_mangle.autoload=true \
  settings.kernel.modules.iptable_raw.allowed=true \
  settings.kernel.modules.iptable_raw.autoload=true \
  settings.kernel.modules.iptable_filter.allowed=true \
  settings.kernel.modules.iptable_filter.autoload=true \
  settings.kernel.modules.ip6table_mangle.allowed=true \
  settings.kernel.modules.ip6table_mangle.autoload=true \
  settings.kernel.modules.ip6table_raw.allowed=true \
  settings.kernel.modules.ip6table_raw.autoload=true \
  settings.kernel.modules.ip6table_filter.allowed=true \
  settings.kernel.modules.ip6table_filter.autoload=true

OR via user-data.toml:

[settings.kernel.modules.ip_tables]
allowed = true
autoload = true

[settings.kernel.modules.iptable_nat]
allowed = true
autoload = true

[settings.kernel.modules.iptable_mangle]
allowed = true
autoload = true

[settings.kernel.modules.iptable_raw]
allowed = true
autoload = true

[settings.kernel.modules.iptable_filter]
allowed = true
autoload = true

[settings.kernel.modules.ip6table_mangle]
allowed = true
autoload = true

[settings.kernel.modules.ip6table_raw]
allowed = true
autoload = true

[settings.kernel.modules.ip6table_filter]
allowed = true
autoload = true

obirhuppertz commented 4 months ago

(Im adding to this issue to have all informations at one place.)

We have a similar issue right now since one of our cluster updated to Bottlerocket v1.20.

Can confirm that adding the missing module - in our case it was iptable_raw module - fixes cilium complaining about missing modules but it does not fix DNS issues when a pod is on the same node as coredns.

Deploying a debug pod and resolving any DNS on a node where a coredns pod lives on results in "i/o timeouts" that can be seen in cilium-agent logs IF you have a clusterwidenetworkpolicy in your cluster. After reading this issue i debugged this a little bit more.

Turns out that an important factor is to have a clusterwidenetworkpolicy. It does not have to deny anything its just has to be there in order to trigger the "i/o timeout". If that policy is deleted you can do internal DNS resolution to services but external DNS would not work.

apiVersion: v1
items:
- apiVersion: cilium.io/v2
  kind: CiliumClusterwideNetworkPolicy
  metadata:
    name: default
  spec:
    egress:
    - toEntities:
      - cluster
    - toEndpoints:
      - matchLabels:
          io.kubernetes.pod.namespace: kube-system
          k8s-app: kube-dns
      toPorts:
      - ports:
        - port: "53"
          protocol: UDP
        rules:
          dns:
          - matchPattern: '*'
    - toCIDRSet:
      - cidr: 0.0.0.0/0

Can anyone confirm having DNS issues as well with Bottlerocket v1.20 and cilium? (pods on same host as coredns cant resolve DNS).

vigh-m commented 4 months ago

Thanks for opening the issue with Cilium. I've added more details there as well.

Was this configuration working for you in Bottlerocket v 1.19.5? There was a change made some time ago which caused DNS issues on binding to 0.0.0.0:53 (documented here) and I am thinking that they could be related.

Also, Cilium requires the modules listed here to be loaded. Can you validate that they are all loaded? I have updated the issue description with commands to load all the required kernel modules.

obirhuppertz commented 4 months ago

@vigh-m our configuration works with BR 1.19.5.

coredns:v1.9.3-eksbuild.6
cilium-1.14.9 (https://artifacthub.io/packages/helm/cilium/cilium/1.14.9)

Only the AMI is different. Just tested to deploy with your changed user_data. Coredns wont start with new configuration complaining about i/o timeouts when trying to read via udp from cluster ip to host ip. Will try with newer cilium version.

The required modules where loaded in BR 1.19.5 in 1.20 only iptable_raw where missing. added it via autoload -> Node started. Tried today with your full module load user_data with reported results.

Next Up: Will rollback to BR 1.19.5 (no user_data modification) -> Update cilium -> Update to BR 1.20. And report back.

(saw over at cilium issue tracker someone mentioned they have the same issue with 1.15.3 but will try anyway as our configuration might be different)

obirhuppertz commented 4 months ago

@vigh-m today i investigated a bit further and followed your path first. I think you are on the right track here regarding missing modules or modules that aren't loaded. Since i had no luck using your settings.kernel.modules i spawned a bottlerocket v1.19.5 and 1.20.0 instance comparing both /proc/config.gz against each other checking https://docs.cilium.io/en/stable/operations/system_requirements/ loaded modules.

The relevant requirements for cilium are as followed:

###
# checking base_requirements ...
###
WARN: CONFIG_NET_CLS_BPF != y. Found m
WARN: CONFIG_NET_SCH_INGRESS != y. Found m
WARN: CONFIG_CRYPTO_USER_API_HASH != y. Found m
###
# checking itpables-based masq requirements ...
###
>>> passed all checks
###
# checking L7 and FQDN Policies requirements ...
###
>>> passed all checks
###
# checking ipsec requirements = y ...
###
>>> passed all checks
###
# checking ipsec requirements = m ...
###
WARN: CONFIG_INET_XFRM_MODE_TUNNEL != m. Found undefined
WARN: CONFIG_CRYPTO_AEAD != m. Found y
WARN: CONFIG_CRYPTO_AEAD2 != m. Found y
WARN: CONFIG_CRYPTO_SEQIV != m. Found y
WARN: CONFIG_CRYPTO_HMAC != m. Found y
WARN: CONFIG_CRYPTO_SHA256 != m. Found y
WARN: CONFIG_CRYPTO_AES != m. Found y
###
# checking bandwith_manager requirements ...
###
>>> passed all checks

This result is identical for v1.19.5 and v1.20.0 so i would rule out missing CONFIG settings here. The lsmod between both instances gave more promissing results.

To reproduce:

lsmod via kubectl exec or other methods and get an output for bottlerocket v1.19.5
repeat for 1.20.0
store each file as lsmod-<version>.txt
cat lsmod-1.19.txt | awk '{ printf("%20s %s %s\n", $1, $3, $4) }' | sed 1,1d > lsmod-1.19_diff.txt
cat lsmod-1.20.txt | awk '{ printf("%20s %s %s\n", $1, $3, $4) }' | sed 1,1d > lsmod-1.20_diff.txt
diff -u lsmod-1.19_diff.txt lsmod-1.20_diff.txt > lsmod_br-diff.txt

--- lsmod-1.19_diff.txt 2024-05-21 16:04:32.422565616 +0200
+++ lsmod-1.20_diff.txt 2024-05-21 16:04:39.150647799 +0200
@@ -1,22 +1,3 @@
-     rpcsec_gss_krb5 0 
-         auth_rpcgss 1 rpcsec_gss_krb5
-               nfsv4 0 
-        dns_resolver 1 nfsv4
-                 nfs 1 nfsv4
-               lockd 1 nfs
-               grace 1 lockd
-              sunrpc 5 rpcsec_gss_krb5,auth_rpcgss,nfsv4,nfs,lockd
-             fscache 1 nfs
-nf_conntrack_netlink 0 
-       nft_chain_nat 0 
-         nft_counter 0 
-         xt_addrtype 0 
-          nft_compat 0 
-        br_netfilter 0 
-              bridge 1 br_netfilter
-                 stp 1 bridge
-                 llc 2 bridge,stp
-                 tls 0 
            xt_TPROXY 2 
       nf_tproxy_ipv6 1 xt_TPROXY
       nf_tproxy_ipv4 1 xt_TPROXY
@@ -24,29 +5,25 @@
               xt_nat 2 
        xt_MASQUERADE 1 
                xt_CT 5 
-             xt_mark 19 
-             cls_bpf 17 
-         sch_ingress 9 
+             xt_mark 17 
+         iptable_raw 1 
+             cls_bpf 19 
+         sch_ingress 10 
                vxlan 0 
       ip6_udp_tunnel 1 vxlan
           udp_tunnel 1 vxlan
            xfrm_user 1 
            xfrm_algo 1 xfrm_user
                 veth 0 
-           xt_socket 1 
-      nf_socket_ipv4 1 xt_socket
-      nf_socket_ipv6 1 xt_socket
-        ip6table_raw 0 
-         iptable_raw 1 
-           nf_tables 3 nft_chain_nat,nft_counter,nft_compat
-           nfnetlink 5 nf_conntrack_netlink,nft_compat,ip_set,nf_tables
+           nf_tables 0 
+           nfnetlink 3 ip_set,nf_tables
      ip6table_filter 1 
         ip6table_nat 1 
          iptable_nat 1 
-              nf_nat 5 nft_chain_nat,xt_nat,xt_MASQUERADE,ip6table_nat,iptable_nat
+              nf_nat 4 xt_nat,xt_MASQUERADE,ip6table_nat,iptable_nat
      ip6table_mangle 1 
         xt_conntrack 1 
-          xt_comment 33 
+          xt_comment 32 
       iptable_mangle 1 
       iptable_filter 1 
                  drm 0 
@@ -55,21 +32,21 @@
             squashfs 2 
       lz4_decompress 1 squashfs
                 loop 4 
-             overlay 34 
+             overlay 29 
         crc32_pclmul 0 
-        crc32c_intel 5 
+        crc32c_intel 4 
  ghash_clmulni_intel 0 
          aesni_intel 0 
-                 ena 0 
          crypto_simd 1 aesni_intel
+                 ena 0 
               cryptd 2 ghash_clmulni_intel,crypto_simd
                  ptp 1 ena
             pps_core 1 ptp
               button 0 
-        sch_fq_codel 5 
-        nf_conntrack 6 nf_conntrack_netlink,xt_nat,xt_MASQUERADE,xt_CT,nf_nat,xt_conntrack
-      nf_defrag_ipv6 3 xt_TPROXY,xt_socket,nf_conntrack
-      nf_defrag_ipv4 3 xt_TPROXY,xt_socket,nf_conntrack
+        sch_fq_codel 3 
+        nf_conntrack 5 xt_nat,xt_MASQUERADE,xt_CT,nf_nat,xt_conntrack
+      nf_defrag_ipv6 2 xt_TPROXY,nf_conntrack
+      nf_defrag_ipv4 2 xt_TPROXY,nf_conntrack
                 fuse 1 
             configfs 1 
            dmi_sysfs 0

My guess would be that the following modules have to be loaded as well or network policies will not work. They did get loaded in bottlerocket v1.19.5.

nf_conntrack_netlink
br_netfilter
bridge
xt_addrtype
xt_socket
nf_socket_ipv4
nf_socket_ipv6

nf_tables loaded nft_chain_nat,nft_counter,nft_compat but is not showing up in bottlerocket v1.20.0

As mentioned here https://docs.cilium.io/en/stable/operations/system_requirements/#requirements-for-l7-and-fqdn-policies xt_socket is a hard requirement for our setup.

@vigh-m though we are seeing different modules not loaded i would concur with your initial finding regarding cilium not beeing able to load its modules.

i believe i do not need to update to a later version of cilium right now to furhter strengthen the findings of yours. Seems pretty clear to me that the underlying issue is in fact modules not beeing loaded thus breaking essential features of cilium.

Regards,

vigh-m commented 4 months ago

Hi @obirhuppertz,

Thank you so much for your deep dive on this. It seems pretty clear that the kernel modules not being loaded are the root cause of the issue. I don't think different versions of cilium will behave differently here.

The list of modules in the initial issue comment is non-exhaustive and is causing the other failures you are seeing. As we discover more such missing modules, that list is bound to get longer.

We're investigating options to build a version of modprobe that can be mounted by customer containers allowing for another workaround to this issue than passing in the right user-data. Hope to have more details on that soon

scbunn-avalara commented 4 months ago

Upgrading to 1.20 has broken Antrea with

[2024-05-30T18:50:59Z INFO install_cni_chaining]: updating CNI conf file 10-aws.conflist -> 05-antrea.conflist
[2024-05-30T18:50:59Z INFO install_cni_chaining]: CNI conf file is already up-to-date
uid=0(root) gid=0(root) groups=0(root)
modprobe: ERROR: could not insert 'openvswitch': Operation not permitted
Failed to load the OVS kernel module from the container, try running 'modprobe openvswitch' on your Nodes

Will investigate adding this module to user-data

bcressey commented 4 months ago

Reopening since most of these changes have been reverted because of an issue found during testing with Alpine Linux, where the injected mount overwrote the busybox binary and rendered the container useless.

bcressey commented 4 months ago

Based on a survey of container images, it seems like /usr/local/sbin/modprobe might be a safer path.

vigh-m commented 4 months ago

Bottlerocket 1.20.1 release contains a new version of kmod built statically to allow customer containers to mount it. This will ensure that the OS provided kernel modules can be loaded by the container without compatibility issues. Here is a sample kubernetes spec which enables this:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: <my-image>
    securityContext:
      capabilities:
        add: ["SYS_MODULE"]
    volumeMounts:
    - name: kmod-static
      mountPath: /usr/local/sbin/modprobe
      readOnly: true
   - name: kernel-modules
      mountPath: /lib/modules/
      readOnly: true
  volumes:
    - name: kmod-static
      hostPath:
      path: /usr/bin/kmod
      type: File
    - name: kernel-modules
      hostPath:
      path: /lib/modules/

This should allow containers to be launched with the correct packages and modules to work with Cilium and any other such workloads.

This is a temporary workaround to enable existing workloads. A more permanent fix is in the works.

obirhuppertz commented 4 months ago

@vigh-m Thanks for the Update! Can confirm that adding the following lines to the values.yaml results in a working cilium again meaning the Workaround fixes the cilium issue for us with bottlerocket-v1.20.1.

extraVolumeMounts:
  - name: kmod-static
    mountPath: /usr/local/sbin/modprobe
    readOnly: true
  - name: kernel-modules
    mountPath: /lib/modules/
    readOnly: true

extraVolumes:
  - name: kmod-static
    hostPath:
      path: /usr/bin/kmod
    type: File
  - name: kernel-modules
    hostPath:
      path: /lib/modules/

project-administrator commented 4 months ago

Thank you @vigh-m. This fix resolves the issue only partially for us because we're still getting 5% of the requests fail rate with the same timeout error.

Some smaller bottlerocket-based nodes fail to load the module on Cilium startup:

$ kubectl logs -n cilium cilium-sf6j6 | ag module
Defaulted container "cilium-agent" out of: cilium-agent, config (init), mount-cgroup (init), apply-sysctl-overwrites (init), mount-bpf-fs (init), clean-cilium-state (init), install-cni-binaries (init)
time="2024-06-07T11:24:29Z" level=warning msg="iptables modules could not be initialized. It probably means that iptables is not available on this system" error="could not load module iptable_raw: exit status 1" subsys=iptables

So the fix would be using userdata to load the required modules reliably.

vigh-m commented 4 months ago

Hi @project-administrator.

Can you share some more details about your setup? Things like, cilium version, how many t3.medium instances are showing this issue, kubernetes version, and anything else you think is relevant.

On my end, I ran a test on 1000 t3.medium instances and ran the following DaemonSet on every node (BR version 1.20.1, k8s: 1.29) but wasn't able to reproduce the error you are seeing. From my testing, the cilium containers are able to load modules as expected:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: my-cilium-test
spec:
  selector:
    matchLabels:
      app: cilium
  template:
    metadata:
      labels:
        app: cilium
    spec:
      containers:
      - name: cilium-container
        image: cilium/cilium:stable
        command: ["/bin/sh", "-c"]
        args:
        - echo 'Starting cilium/cilium:stable Test';echo '---------------------------------------';if
            test -f /usr/local/sbin/modprobe; then echo 'Mounted correctly'; else exit 1;
            fi;if ! grep -E '^lru_cache' /proc/modules; then echo 'Adding a module'; modprobe
            lru_cache; grep -E '^lru_cache' /proc/modules && echo 'Successfully added a
            module'; fi;if grep -E '^lru_cache' /proc/modules; then echo 'Removing a module';
            modprobe -r lru_cache; grep -E '^lru_cache' /proc/modules || echo 'Successfully
            removed a module'; fi;exit 0
        securityContext:
          capabilities:
            add: ["SYS_MODULE"]
        volumeMounts:
        - name: kernel-modules
          mountPath: /lib/modules/
          readOnly: true
        - name: kmod-static
          mountPath: /usr/local/sbin/modprobe
          readOnly: true
      volumes:
      - hostPath:
          path: /lib/modules/
        name: kernel-modules
      - name: kmod-static
        hostPath:
          path: /usr/bin/kmod
          type: File

project-administrator commented 4 months ago

@vigh-m We have an issue loading a Cilium module only on these instances: t2.medium/ t3.medium. We have only two of these instances running under the EKS node group, they are dedicated for running only karpenter and both of them exhibited the same behavior. We tested this only on one environment.

Here is the instance configuration:

$ cilium -n cilium status
    /¯¯\
 /¯¯\__/¯¯\    Cilium:             1 warnings
 \__/¯¯\__/    Operator:           OK
 /¯¯\__/¯¯\    Envoy DaemonSet:    disabled (using embedded mode)
 \__/¯¯\__/    Hubble Relay:       OK
    \__/       ClusterMesh:        disabled

Deployment             hubble-relay       Desired: 1, Ready: 1/1, Available: 1/1
Deployment             cilium-operator    Desired: 3, Ready: 3/3, Available: 3/3
DaemonSet              cilium             Desired: 15, Ready: 15/15, Available: 15/15
Deployment             hubble-ui          Desired: 1, Ready: 1/1, Available: 1/1
Containers:            cilium-operator    Running: 3
                       hubble-ui          Running: 1
                       hubble-relay       Running: 1
                       cilium             Running: 15
Cluster Pods:          173/176 managed by Cilium
Helm chart version:    1.15.5
Image versions         hubble-relay       quay.io/cilium/hubble-relay:v1.15.5@sha256:1d24b24e3477ccf9b5ad081827db635419c136a2bd84a3e60f37b26a38dd0781: 1
                       cilium             quay.io/cilium/cilium:v1.15.5@sha256:4ce1666a73815101ec9a4d360af6c5b7f1193ab00d89b7124f8505dee147ca40: 15
                       cilium-operator    quay.io/cilium/operator-aws:v1.15.5@sha256:f9c0eaea023ce5a75b3ed1fc4b783f390c5a3c7dc1507a2dc4dbc667b80d1bd9: 3
                       hubble-ui          quay.io/cilium/hubble-ui:v0.13.0@sha256:7d663dc16538dd6e29061abd1047013a645e6e69c115e008bee9ea9fef9a6666: 1
                       hubble-ui          quay.io/cilium/hubble-ui-backend:v0.13.0@sha256:1e7657d997c5a48253bb8dc91ecee75b63018d16ff5e5797e5af367336bc8803: 1
Warnings:              cilium             cilium-q9b62    1 endpoints are not ready

$ k version
...
Server Version: v1.29.4-eks-036c24b

$ helm get values -n cilium cilium
USER-SUPPLIED VALUES:
USER-SUPPLIED VALUES: null
egressMasqueradeInterfaces: eth+
eni:
  awsReleaseExcessIPs: true
  enabled: true
extraVolumeMounts:
- mountPath: /usr/local/sbin/modprobe
  name: kmod-static
  readOnly: true
- mountPath: /lib/modules/
  name: kernel-modules
  readOnly: true
extraVolumes:
- hostPath:
    path: /usr/bin/kmod
  name: kmod-static
  type: File
- hostPath:
    path: /lib/modules/
  name: kernel-modules
gatewayAPI:
  enabled: true
hostServices:
  enabled: true
  protocols: tcp
hubble:
  enabled: true
  listenAddress: :4244
  metrics:
    enabled:
    - dns
    - drop
    - tcp
    - flow
    - port-distribution
    - icmp
    - http
  relay:
    enabled: true
    resources:
      limits:
        cpu: 100m
        memory: 48Mi
      requests:
        cpu: 50m
        memory: 24Mi
    rollOutPods: true
  ui:
    backend:
      resources:
        limits:
          cpu: 100m
          memory: 256Mi
        requests:
          cpu: 50m
          memory: 128Mi
    enabled: true
    frontend:
      resources:
        limits:
          cpu: 100m
          memory: 256Mi
        requests:
          cpu: 50m
          memory: 128Mi
    proxy:
      resources:
        limits:
          cpu: 100m
          memory: 256Mi
        requests:
          cpu: 50m
          memory: 128Mi
ingressController:
  enabled: true
  loadbalancerMode: dedicated
ipam:
  mode: eni
kubeProxyReplacement: partial
loadBalancer:
  l7:
    backend: envoy
nodePort:
  enabled: true
nodeinit:
  enabled: false
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
  restartPods: true
operator:
  nodeSelector:
    node.k8s.org/role: core
  prometheus:
    enabled: true
  replicas: 3
  resources:
    requests:
      cpu: 100m
      memory: 64Mi
  rollOutPods: true
policyEnforcementMode: always
prometheus:
  enabled: true
relay:
  nodeSelector:
    node.k8s.org/role: core
resources:
  requests:
    cpu: 100m
    memory: 192Mi
rollOutCiliumPods: true
tunnel: disabled
ui:
  nodeSelector:
    node.k8s.org/role: core

The nodes have these pods running:

$ k describe no my-node
...
Non-terminated Pods:          (6 in total)
  Namespace                   Name                                         CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                         ------------  ----------  ---------------  -------------  ---
  cilium                      cilium-wpgrt                                 100m (6%)     0 (0%)      192Mi (7%)       0 (0%)         4d18h
  kube-system                 kube-proxy-hh6nk                             100m (6%)     0 (0%)      0 (0%)           0 (0%)         4d18h
  monitoring                  prometheus-prometheus-node-exporter-q27ll    0 (0%)        0 (0%)      0 (0%)           0 (0%)         4d18h
  operators                   ebs-csi-node-jgls7                           30m (1%)      0 (0%)      120Mi (4%)       768Mi (29%)    4d18h
  operators                   efs-csi-node-ls8bz                           100m (6%)     0 (0%)      128Mi (4%)       0 (0%)         4d18h
  operators                   karpenter-657c4785d6-2d588                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         4d17h

And they have a taint as well:

$ k get node -o yaml my-node
...
spec:
  taints:
  - effect: NoSchedule
    key: karpenter

And here is the userdata that we use for these nodes:

$ cat bottlerocket-userdata.tpl
# https://github.com/bottlerocket-os/bottlerocket/blob/develop/README.md#description-of-settings
[settings.kubernetes]
cluster-name = "${cluster_name}"
api-server = "${cluster_endpoint}"
cluster-certificate = "${cluster_ca_data}"

# Hardening based on https://github.com/bottlerocket-os/bottlerocket/blob/develop/SECURITY_GUIDANCE.md

# Enable kernel lockdown in "integrity" mode.
# This prevents modifications to the running kernel, even by privileged users.
[settings.kernel]
lockdown = "integrity"

# The admin host container provides SSH access and runs with "superpowers".
# It is disabled by default, but can be disabled explicitly.
[settings.host-containers.admin]
enabled = ${enable_admin_container}
superpowered = true

# The control host container provides out-of-band access via SSM.
# It is enabled by default, and can be disabled if you do not expect to use SSM.
# This could leave you with no way to access the API and change settings on an existing node!
[settings.host-containers.control]
enabled = ${enable_control_container}
superpowered = true

# Eviction settings
[settings.kubernetes.eviction-hard]
"memory.available" = "200Mi"

# Reservation settings
[settings.kubernetes.system-reserved]
cpu = "200m"
memory = "500Mi"
ephemeral-storage= "1Gi"

# Kubelet reservation
[settings.kubernetes.kube-reserved]
cpu = "200m"
memory = "550Mi"
ephemeral-storage= "1Gi"

# Strip unneeded kernel modules
[settings.kernel.modules.sctp]
allowed = false

[settings.kernel.modules.udf]
allowed = false

[settings.kernel.modules.rds]
allowed = false

[settings.kernel.modules.dccp]
allowed = false

[settings.kernel.modules.jffs2]
allowed = false

[settings.kernel.modules.cramfs]
allowed = false

[settings.kernel.modules.freevxfs]
allowed = false

[settings.kernel.modules.hfs]
allowed = false

[settings.kernel.modules.hfsplus]
allowed = false

[settings.kernel.modules.squashfs]
allowed = false

[settings.kernel.modules.vfat]
allowed = false

[settings.kernel.modules.usb-storage]
allowed = false

[settings.kernel.modules.tipc]
allowed = false

# Sysctl tuning
[settings.kernel.sysctl]
"vm.swappiness" = "0"
"net.ipv4.tcp_syncookies" = "1"
"net.ipv4.ip_local_port_range" = "1024 65535"
"net.ipv4.tcp_tw_recycle" = "1"
"net.ipv4.tcp_fin_timeout" = "15"
"net.core.somaxconn" = "4096"
"net.core.netdev_max_backlog" = "4096"
"net.core.rmem_max" = "16777216"
"net.core.wmem_max" = "16777216"
"net.ipv4.tcp_max_syn_backlog" = "20480"
"net.ipv4.tcp_no_metrics_save" = "1"
"net.ipv4.tcp_rmem" = "4096 87380 16777216"
"net.ipv4.tcp_syn_retries" = "2"
"net.ipv4.tcp_synack_retries" = "2"
"net.ipv4.tcp_wmem" = "4096 65536 16777216"
"net.ipv4.tcp_max_tw_buckets" = "1440000"
"net.core.default_qdisc" = "fq_codel"
"net.ipv4.tcp_congestion_control" = "bbr"
${more_options}

# Add some labels
[settings.kubernetes.node-labels]
${node_labels}

vigh-m commented 3 months ago

Hi @project-administrator,

I've been trying to reproduce the error you see but haven't had much success. This is after using the same Bottlerocket configs you shared. I am using the Cilium Quick Installation method via the cilium-cli.

    /¯¯\
 /¯¯\__/¯¯\    Cilium:             OK
 \__/¯¯\__/    Operator:           OK
 /¯¯\__/¯¯\    Envoy DaemonSet:    disabled (using embedded mode)
 \__/¯¯\__/    Hubble Relay:       OK
    \__/       ClusterMesh:        disabled

Deployment             hubble-relay       Desired: 1, Ready: 1/1, Available: 1/1
Deployment             cilium-operator    Desired: 1, Ready: 1/1, Available: 1/1
DaemonSet              cilium             Desired: 2, Ready: 2/2, Available: 2/2
Containers:            cilium             Running: 2
                       hubble-relay       Running: 1
                       cilium-operator    Running: 1
Cluster Pods:          7/7 managed by Cilium
Helm chart version:
Image versions         cilium             quay.io/cilium/cilium:v1.15.6@sha256:6aa840986a3a9722cd967ef63248d675a87add7e1704740902d5d3162f0c0def: 2
                       hubble-relay       quay.io/cilium/hubble-relay:v1.15.6@sha256:a0863dd70d081b273b87b9b7ce7e2d3f99171c2f5e202cd57bc6691e51283e0c: 1
                       cilium-operator    quay.io/cilium/operator-aws:v1.15.6@sha256:9656d44ee69817d156cc7d3797f92de2e534dfb991610c79c00e097b4dedd620: 1

Are you able to log into the cilium-agent container via kubectl exec -it cilium-tdl85 -c cilium-agent --namespace kube-system -- bash? Can you validate that you can see modprobe mounted on your container like below:

root@ip:/home/cilium# cat /proc/mounts | grep modprobe
/dev/root /usr/local/sbin/modprobe ext4 ro,seclabel,relatime,stripe=1024 0 0

Also the modprobe should be a static executable

root@ip:/home/cilium# ldd /usr/local/sbin/modprobe
    not a dynamic executable

vigh-m commented 3 months ago

Bottlerocket release v1.20.2 contains changes to the default files mounted to containers. The change mounts bottlerocket provided, statically linked kmod package to the /usr/local/sbin/modprobe path.

This should resolve issues with Cilium, Antrea, and similar workloads being unable to modify kernel modules, without requiring any Bottlerocket specific changes to configurations.