Open zaphod72 opened 1 month ago
The gke-gcsfuse-sidecar
is a native sidecar container, which should be an init container with restartPolicy: Always
. But for some reason, the restartPolicy is missing from the spec. There may be other webhooks configured on this cluster that removed the restartPolicy.
Can you share the cluster ID with me? You can get the ID by running gcloud container clusters describe <cluster-name> --location <cluster-location> | grep id:
, and share the id with me? Thanks!
Thanks @songjiaxun
Cluster id: 049c60badca8467abfa1901253886a0e9c543c4b71d549439fb968273a2751e4
Checked the Pod creation audit log using the following query:
"vllm-service-fea6900001db78d227f60296bb6cc1ab7e1110e-deplo9fd8x"
"pods.create"
logName="projects/darren2-dev-d0d0/logs/cloudaudit.googleapis.com%2Factivity"
I see the sidecar gke-gcsfuse-sidecar
was modified by the knative
webhook -- the webhook did remove the restartPolicy: Always
:
patch.webhook.admission.k8s.io/round_1_index_5: "{"configuration":"istio-inject.webhook.crfa.internal.knative.dev","webhook":"istio-inject.webhook.crfa.internal.knative.dev","patch":[{"op":"remove","path":"/spec/initContainers/0/restartPolicy"}],"patchType":"JSONPatch"}"
We are seeing similar issues from other users. The cause of this issue is that the native sidecar feature is not recognized by some webhooks, so the incompatible webhook will remove the restartPolicy: Always
from the init sidecar container, making it block the regular container initialization.
Workaround 1
A quick workaround is to add a new node pool using 1.28
nodes to the cluster. You can use the smallest node size, and just add one node. Then redeploy your workload. The webhook will inject the gke-gcsfuse-sidecar
container as a regular container. You don't need to change your workload spec. Note that this new node will be charged, unfortunately.
gcloud container --project "<your-project>" node-pools create "pool-dummy" --cluster "<your-cluster-name>" --location "<your-cluster-location>" --node-version "1.28" --machine-type "e2-micro" --image-type "COS_CONTAINERD" --disk-type "pd-standard" --disk-size "10" --num-nodes "1"
Workaround 2
You can manually inject the gke-gcsfuse-sidecar
container into your workload as a regular
container, and also add three auxiliary volumes. Meanwhile, please remove the annotation gke-gcsfuse/volumes: "true"
. Then re-deploy your workload.
apiVersion: v1
kind: Pod
metadata:
name: test
annotations:
# gke-gcsfuse/volumes: "true" <- remove this annotation
spec:
containers:
# add the gke-gcsfuse-sidecar BEFORE your workload container
- args:
- --v=5
image: gke.gcr.io/gcs-fuse-csi-driver-sidecar-mounter:v1.4.2-gke.0@sha256:80c2a52aaa16ee7d9956a4e4afb7442893919300af84ae445ced32ac758c55ad
imagePullPolicy: IfNotPresent
name: gke-gcsfuse-sidecar
resources:
requests:
cpu: 250m
ephemeral-storage: 5Gi
memory: 256Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsGroup: 65534
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
volumeMounts:
- mountPath: /gcsfuse-tmp
name: gke-gcsfuse-tmp
- mountPath: /gcsfuse-buffer
name: gke-gcsfuse-buffer
- mountPath: /gcsfuse-cache
name: gke-gcsfuse-cache
- name: your-workload
...
volumes:
# add following three volumes
- emptyDir: {}
name: gke-gcsfuse-tmp
- emptyDir: {}
name: gke-gcsfuse-buffer
- emptyDir: {}
name: gke-gcsfuse-cache
Long-term fix
We are actively working on fixing this issue.
Thank you - workaround 2 - inject the gke-gcsfuse-sidecar
container as a regular
container, is working :)
Hi, I encountered the same issue, and I was able to solve it using workaround 2 mentioned here, thanks.
I have a quick question. Could you give me reasons why workaround 1 use v1.28? I understand that the native sidecar container feature was introduced at v1.28 (ref). So we should use more older version for workaround 1 I just thought.
FYI: My service worked without issues at v1.28. But after upgrading to v1.29, I encountered this issue. Datadog webhook removed restartPolicy: Always
.
GKE Autopilot Cluster. Rapid release channel - Cluster and Node at: 1.30.2-gke.1587003
Deployment is a Knative service. The Pod does not start with reason "Container istio-proxy is waiting".
Similar issues: https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/issues/20 https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/issues/53
As per https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver#pod-annotations the Pod annotations include:
Full Pod spec:
GCSFuse container logs: