Persistent volumes in stuck state after reboot

coreos / container-linux-update-operator

A Kubernetes operator to manage updates of Container Linux by CoreOS

Apache License 2.0

209 stars 49 forks source link

Persistent volumes in stuck state after reboot #191

Open dustinmm80 opened 5 years ago

dustinmm80 commented 5 years ago

It appears there is a race condition when using persistent volumes, where the pod is deleted and the node is rebooted, but the attached volume is still in the process of detaching. Once this happens, the persistent volume is stuck and must be manually removed and recreated.

I'm seeing this on AWS with EBS volumes.

metalmatze commented 5 years ago

I'm not sure hat that relates to CLUO? Are you running its Pods with volumes?

dustinmm80 commented 5 years ago

No, not running CLUO with volumes. Is it possible that when the operator terminates the pods, it reboots before PV are properly detached?

embik commented 5 years ago

Hey @dustinmm80, are you by chance using the CSI implemention of EBS volumes?

I see something similar with the Cinder CSI driver and I suspect it to be related to VolumeAttachment resources or rather the fact that CSI components might not be fast enough to detach volumes before CLUO reboots the machine.

Just wanted to check in before investigating this.

embik commented 5 years ago

Thinking about it, I'm not sure it's only related to CSI. But it's probably a major issue with StatefulSets, because Kubernetes won't create a new statefulset-example-0 on another node before the old one finished deletion.

And only after scheduling the new StatefulSet pod to a new node the CSI components will start churning and update the VolumeAttachment, which will unmount the volume on the old node and mount it on the new node. But that process takes a few second and the old node is already being rebooted.

yannh commented 5 years ago

We had the same issue a few weeks ago - it turned out to be an issue with newer, Nitro instances (c5, t3, ...) - AWS investigated and claims to have now solved the problem. Were you seeing that problem with nitro instances? Are you still seeing the problem?