Detach volume from node, when node is not available.

digitalocean / csi-digitalocean

A Container Storage Interface (CSI) Driver for DigitalOcean Block Storage

Apache License 2.0

577 stars 107 forks source link

Detach volume from node, when node is not available. #160

Closed feluxe closed 5 years ago

feluxe commented 5 years ago

What did you do?

Create a kubernetes cluster via the DO web-interface.
Create a node pool consisting of 2 nodes.
Deploy an app with a PVC into the cluster.
Wait until the block storage volume gets attached to one of the nodes and the pod runs fine.
Switch off the node that the pod runs on via the DO web-interface.
Kubernetes will now try to re-create the pod on the other node... but it fails, because the volume remains attached to the node that was switched off and you are left with:

Multi-Attach error for volume ...

What did you expect to happen?

After the node shuts down, I expect the controller to detach the volume from the node, so that the attempt to create the new pod on the second node won't fail.

Configuration:

I just tried this with a 1.14 cluster that I created via the DO web-interface.

paintcast commented 5 years ago

v1.11.10-do.f.1 same issue

Multi-Attach error for volume Volume is already exclusively attached to one node and can't be attached to another

I tried to delete my PVC: PVC was deleted, but PV stuck in 'Terminating' while Volume was deleted, and I cannot see it in DO web-interface.

I created new PVC, and updated my Deployment with it. New Pod was created successfully with new empty Volume. I attached old Volume to the Node where new Pod was deployed, and just manually copied data with rsync from old to new Volume.

It doesn't fix an issue, but helps to use data from old volume if Node crashed.

MikeMichel commented 5 years ago

And again I end in the csi repo of a cloud provider when I test node/volume failover. Even if there is a manual way with docli and kubectl it would make the purpose of running a "cluster" useless when the most important apps (the ones with data) can not failover without manual intervention.

kubernetes v1.15.3

snormore commented 5 years ago

Hey there 👋

CSI external-attacher v1.2.0 has fixes for issues contributing to this behaviour, which is available on our latest 1.15 and 1.14 DOKS images (1.15.3-do.2+ and 1.14.6-do.2+).

I've put together an example showing the expected behaviour during failover as described: https://gist.github.com/snormore/8e6b62dc7fd5b3f823b416bb90619081

Can you confirm which version of DOKS you're using, and if the issue described persists for you after trying with one of the latest versions?

timoreimann commented 5 years ago

Closing as this issue is presumably solved by using a recent enough release. Please post if this is not the case.

captainjapeng commented 4 years ago

I believe this is what happened to my cluster(1.16.8-do.0) last week, upon checking 1 of 5 nodes is unresponsive. Kubernetes Dashboard's Nodes menu shows as a Question Mark icon. Pods were able to move to a new node except postgres (deployment, crunchydata postgres-operator) w/c is stuck waiting for the volume to detach from the unresponsive node.

timoreimann commented 4 years ago

@captainjapeng thanks for reporting. My hunch is that you are seeing a somewhat different issue.

If you're a DOKS user, could you please submit a ticket so that we can look into your case. If you're managing a cluster on your own that runs on DigitalOcean infrastructure, could I ask you to file a new ticket on this repo and share all relevant logs and information?

Thank you!

captainjapeng commented 4 years ago

I'm using DOKS, tho I had already deleted the said node and scaled down the cluster. Would that still help?

timoreimann commented 4 years ago

@captainjapeng it's hard to say ahead of time -- CSI-related issues tend to cover a wide spectrum of causes. (Stateful systems are hard.) We could take a look nevertheless if you decided to submit a support request.

captainjapeng commented 4 years ago

The issue has occurred again, I've opened Ticket #3490383 for the screenshots and logs