cilium / cilium

eBPF-based Networking, Security, and Observability
https://cilium.io
Apache License 2.0
20.26k stars 2.97k forks source link

Installing Cilium on GKE v1.30.4: cp: cannot create regular file '/hostbin/cilium-mount': Read-only file system #35336

Open maelvls opened 1 month ago

maelvls commented 1 month ago

Is there an existing issue for this?

Version

equal or higher than v1.16.0 and lower than v1.17.0

What happened?

Dear Cilium community,

While following the quick installation guide for GKE at https://docs.cilium.io/en/stable/gettingstarted/k8s-install-default/, I got stuck with the following error message on the cilium deamonset; more specifically, the mount-cgroup container would show:

cp: cannot create regular file '/hostbin/cilium-mount': Read-only file system

I double-checked: /opt/cni/bin's permissions are already 0755 on the nodes:

$ ls -al /opt/cni/bin/
total 85588
drwxr-xr-x 2 root root     4096 Aug 24 17:25 .

It seems like this isn't the right cni-path; it looks like GKE uses the CNI path /home/kubernetes/bin, as detailed in https://github.com/weaveworks/weave/issues/3466#issuecomment-443987066.

Workaround

To work around this issue, I've set the CNI path to /home/kubernetes/bin instead of /opt/cni/bin/, but I don't know if that's the correct way forward.

Solution:

cilium install --version 1.16.2 \     
    --set cni.binPath=/home/kubernetes/bin

How can we reproduce the issue?

I created a GKE cluster with the recommended node taint with the following:

gcloud config set project jetstack-mael-valais
gcloud container clusters create test --zone=europe-west2-b
gcloud beta container node-pools update default-pool \
  --node-taints=node.cilium.io/agent-not-ready=true:NoExecute --cluster=test --zone europe-west2-b
gcloud container clusters get-credentials test --zone=europe-west2-b

(I forgot to taint the nodes on creation; I added the taint afterwards, as seen in the commands)

I'm using the Cilium CLI v0.16.19 installed with Homebrew. I followed the official command for installing Cilium:

cilium install --version 1.16.2 --set cluster.name=test

The deamonset shows as "failing":

$ cilium status
    /¯¯\
 /¯¯\__/¯¯\    Cilium:             1 errors, 3 warnings
 \__/¯¯\__/    Operator:           OK
 /¯¯\__/¯¯\    Envoy DaemonSet:    OK
 \__/¯¯\__/    Hubble Relay:       disabled
    \__/       ClusterMesh:        disabled

DaemonSet              cilium             Desired: 3, Unavailable: 3/3
DaemonSet              cilium-envoy       Desired: 3, Ready: 3/3, Available: 3/3
Deployment             cilium-operator    Desired: 2, Ready: 2/2, Available: 2/2
Containers:            cilium             Pending: 3
                       cilium-envoy       Running: 3
                       cilium-operator    Running: 2
Cluster Pods:          16/19 managed by Cilium
Helm chart version:    1.16.2
Image versions         cilium             quay.io/cilium/cilium:v1.16.2@sha256:4386a8580d8d86934908eea022b0523f812e6a542f30a86a47edd8bed90d51ea: 3
                       cilium-envoy       quay.io/cilium/cilium-envoy:v1.29.9-1726784081-a90146d13b4cd7d168d573396ccf2b3db5a3b047@sha256:9762041c3760de226a8b00cc12f27dacc28b7691ea926748f9b5c18862db503f: 3
                       cilium-operator    quay.io/cilium/operator-generic:v1.16.2@sha256:cccfd3b886d52cb132c06acca8ca559f0fce91a6bd99016219b1a81fdbc4813a: 2
Errors:                cilium             cilium          3 pods of DaemonSet cilium are not ready
Warnings:              cilium             cilium-82xlx    pod is pending
                       cilium             cilium-96d5z    pod is pending
                       cilium             cilium-gqltr    pod is pending

Cilium Version

$ cilium version --client
cilium-cli: v0.16.19 compiled with go1.23.2 on darwin/arm64
cilium image (default): v1.16.2
cilium image (stable): v1.16.2

Kernel Version

The VM's disk source is https://www.googleapis.com/compute/v1/projects/gke-node-images/global/images/gke-1304-gke1348000-cos-113-18244-151-27-c-pre (Google's Container-Optimized OS)

Linux gke-test-default-pool-fcd90760-5df2 6.1.100+ #1 SMP PREEMPT_DYNAMIC Sat Aug 24 16:19:44 UTC 2024 x86_64 Intel(R) Xeon(R) CPU @ 2.20GHz GenuineIntel GNU/Linux

Kubernetes Version

$ k version                                                                        
Client Version: v1.31.1
Kustomize Version: v5.4.2
Server Version: v1.30.4-gke.1348000

Sysdump

cilium-sysdump-20241010-104102.zip

Anything else?

Background story: I'm investigating an issue with cert-manager's ACME with Gateway API integration in cert-manager 1.16: https://github.com/cert-manager/cert-manager/issues/7337.

joestringer commented 1 month ago

Thanks for the report @maelvls . Do you have full output from the cilium install command?

It looks like for some reason the CLI is not detecting that this environment is GKE. This code is intended to automatically set the Helm option when the CLI detects GKE, but evidently it doesn't seem to be occuring in this case:

https://github.com/cilium/cilium/blob/cb318104f34421d1c36a6e021c92e91e2b512d3a/cilium-cli/install/helm.go#L38