Failed to change ownership on socket

dippynark commented 2 months ago

Issue

We have observed the following error when using the GCS FUSE CSI Driver on GKE:

/csi.v1.Node/NodePublishVolume failed with error: rpc error: code = Internal desc = failed to mount volume "[REDACTED]" to target path "/var/lib/kubelet/pods/[REDACTED]/volumes/kubernetes.io~csi/[REDACTED]/mount": failed to change ownership on socket: chown ./socket: no such file or directory

It appears the socket file could not be found after being created: https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/blob/v1.4.1/pkg/csi_mounter/csi_mounter.go#L165-L167

Perhaps there is a race condition when changing directory? https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/blob/v1.4.1/pkg/csi_mounter/csi_mounter.go#L142-L144

Impact

This issue seemed to cause the outcome described in the known issues doc where FUSE mount operations hang. I guess this is because socket creation happens after creating the FUSE mount but before passing the file descriptor to the GCS FUSE CSI Driver sidecar:

This interacted with a known kubelet issue where Pod cleanup hangs due to an unresponsive volume mount: https://github.com/kubernetes/kubernetes/issues/101622

This then lead to all Pod actions stalling on the node: https://github.com/kubernetes/kubernetes/blob/v1.27.0/pkg/kubelet/kubelet.go#L148-L151

Confusingly, the node was not marked as unhealthy when this happened, however this seems to be due to an unrelated GKE node-problem-detector misconfiguration which I won't give details on here. Unfortunately, since this occurred in a production environment, we needed to manually deleted the node to bring the cluster back to a healthy state so it's not still around to verify this theory.

This issue has happened twice now on different nodes in the same cluster over the last week.

Note that the kubelet issue seems to have been fixed now, but not in the version of Kubernetes we are using: https://github.com/kubernetes/kubernetes/pull/119968

Evironment

GKE version: v1.27.11-gke.1062004 GCS FUSE version: v1.4.1-gke.0

songjiaxun commented 2 months ago

Hi @dippynark ,

The directory switch operation you are mentioning https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/blob/v1.4.1/pkg/csi_mounter/csi_mounter.go#L142-L144 has a lock to avoid race condition: https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/blob/4ad4e2c4389c177a5f88a6eed9cfaee82e04ef9b/pkg/csi_mounter/csi_mounter.go#L66

Could you share more details about your Pod scheduling pattern? Specifically, how many Pods are you scheduling to the same node at the same time? Thank you!

dippynark commented 2 months ago

Hi @songjiaxun, thanks for clearing that up,

There are 3 CronJobs each creating a Job every minute which each run one Pod that mounts a GCS bucket. All Pods are mounting the same GCS bucket.

Each Pod does a small amount of processing and then exits so each Job typically takes between 30-40 seconds to run. We're using concurrencyPolicy: Forbid on the CronJobs so we don't get more Jobs running than CronJobs even if they sometimes take longer than a minute to run.

We are also using the optimize utilization GKE autoscaling profile which means the 3 Pods are typically all scheduled to the same node at similar times.

Also, after seeing the socket error, we then started seeing lots of errors like the following (which we weren't seeing before the socket error):

/csi.v1.Node/NodePublishVolume failed with error: rpc error: code = Aborted desc = An operation with the given volume key /var/lib/kubelet/pods/[REDACTED]/volumes/kubernetes.io~csi/[REDACTED]/mount already exists

songjiaxun commented 2 months ago

Thanks @dippynark for reporting this issue. I am trying to reproduce this on my dev env now.

songjiaxun commented 2 months ago

Also, @dippynark , as we are moving forward to newer k8s versions, is it possible that you could consider upgrading your cluster version to 1.29? As you mentioned, the kubelet has a fix of the house keeping logic, and we will have a better chance to push any potential fixes much faster to newer k8s versions.

dippynark commented 2 months ago

Hi @songjiaxun, thanks yeah we are working on upgrading the cluster to latest version in the stable channel which should hopefully avoid this issue reoccurring

GoogleCloudPlatform / gcs-fuse-csi-driver