Open dippynark opened 2 months ago
Hi @dippynark ,
The directory switch operation you are mentioning https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/blob/v1.4.1/pkg/csi_mounter/csi_mounter.go#L142-L144 has a lock to avoid race condition: https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/blob/4ad4e2c4389c177a5f88a6eed9cfaee82e04ef9b/pkg/csi_mounter/csi_mounter.go#L66
Could you share more details about your Pod scheduling pattern? Specifically, how many Pods are you scheduling to the same node at the same time? Thank you!
Hi @songjiaxun, thanks for clearing that up,
There are 3 CronJobs each creating a Job every minute which each run one Pod that mounts a GCS bucket. All Pods are mounting the same GCS bucket.
Each Pod does a small amount of processing and then exits so each Job typically takes between 30-40 seconds to run. We're using concurrencyPolicy: Forbid
on the CronJobs so we don't get more Jobs running than CronJobs even if they sometimes take longer than a minute to run.
We are also using the optimize utilization GKE autoscaling profile which means the 3 Pods are typically all scheduled to the same node at similar times.
Also, after seeing the socket error, we then started seeing lots of errors like the following (which we weren't seeing before the socket error):
/csi.v1.Node/NodePublishVolume failed with error: rpc error: code = Aborted desc = An operation with the given volume key /var/lib/kubelet/pods/[REDACTED]/volumes/kubernetes.io~csi/[REDACTED]/mount already exists
Thanks @dippynark for reporting this issue. I am trying to reproduce this on my dev env now.
Also, @dippynark , as we are moving forward to newer k8s versions, is it possible that you could consider upgrading your cluster version to 1.29? As you mentioned, the kubelet has a fix of the house keeping logic, and we will have a better chance to push any potential fixes much faster to newer k8s versions.
Hi @songjiaxun, thanks yeah we are working on upgrading the cluster to latest version in the stable channel which should hopefully avoid this issue reoccurring
Issue
We have observed the following error when using the GCS FUSE CSI Driver on GKE:
It appears the socket file could not be found after being created: https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/blob/v1.4.1/pkg/csi_mounter/csi_mounter.go#L165-L167
Perhaps there is a race condition when changing directory? https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/blob/v1.4.1/pkg/csi_mounter/csi_mounter.go#L142-L144
Impact
This issue seemed to cause the outcome described in the known issues doc where FUSE mount operations hang. I guess this is because socket creation happens after creating the FUSE mount but before passing the file descriptor to the GCS FUSE CSI Driver sidecar:
This interacted with a known kubelet issue where Pod cleanup hangs due to an unresponsive volume mount: https://github.com/kubernetes/kubernetes/issues/101622
This then lead to all Pod actions stalling on the node: https://github.com/kubernetes/kubernetes/blob/v1.27.0/pkg/kubelet/kubelet.go#L148-L151
Confusingly, the node was not marked as unhealthy when this happened, however this seems to be due to an unrelated GKE node-problem-detector misconfiguration which I won't give details on here. Unfortunately, since this occurred in a production environment, we needed to manually deleted the node to bring the cluster back to a healthy state so it's not still around to verify this theory.
This issue has happened twice now on different nodes in the same cluster over the last week.
Note that the kubelet issue seems to have been fixed now, but not in the version of Kubernetes we are using: https://github.com/kubernetes/kubernetes/pull/119968
Evironment
GKE version:
v1.27.11-gke.1062004
GCS FUSE version:v1.4.1-gke.0