Open Artebomba opened 7 months ago
@Artebomba thanks for the detailed report! It seems that you have run into the same problem as #335. What we have been able to find out is that this may be caused by a change in volume attachment introduced in K8s 1.27 and we need to update csi-s3 to reflect this.
We are working on this issue and hope to have an update soon
What happened?
I am using an s3 bucket as volume for my app running in k8s (deployment, 1 replica, rolling update).
When I triggered the deployment of a new revision of my app, the new pod got up and the s3 bucket attached to the new pod. However, the old pod is failed because it was terminated with an error 137 (probably SIGKILL instead of OOM due to a small gracefull shutdown window, because I don't see any memory related issues right now).
There is an issue with the old pod which is stuck in a terminating state, likely due to a volume problem.
Datashim can not unmount the volume from node where was the old pod. csi-s3 pod (csi-s3 container, daemonset) log:
Onwards,
Output: umount: can't unmount /var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc- 1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount: Invalid argument
repeats indefinitely.Kubelet produces similar logs on the node where the failed pod is:
Even after I forthfully deleted a failed pod, errors did not disappear.
Pods description: kubectl get pod survey-service-84cf8d9d49-xhbxq -o yaml
When I go to the node where the failed pod is, there is no active fuse filesystem mounted to `/var/lib/kubelet/pods/36180d66-5fa5-4393-a84d-df95afe5a369/*:
In the csi-s3 container, I may found unfinished goofys process for
/var/lib/kubelet/pods/36180d66-5fa5-439 3-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/
volume:The pods directory
/var/lib/kubelet/pods/36180d66-5fa5-439 3-a84d-df95afe5a369/volumes/kubernetes.io~csi/pvc-1ec40a67-78bd-4501-9dd8-bd2c4cb58648/mount
actually exists on the node, but It is empty:It appears that the old volumes filesystem is not mounted as it is not visible in the output of
df -hT -t fuse
.I guess that my pod was stuck in a terminating state because Kubelet can not finish some tasks (maybe admissions controllers involved, garbage collector) and it leaves the pod in this state. I want to fix that.
Worth mentioned that if the pod finished without error (not failed), then no s3 errors/problems occur Thanks in advance.
What did you expect to happen?
Kubelet completely terminates the failed pod. The volume management is healthy.
Kubernetes version
Cloud provider
AWS EKS 1.27
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)