IBM / power-openstack-k8s-volume-driver

power-openstack-k8s-volume-driver
Apache License 2.0
2 stars 13 forks source link

Server death does not automatically release LUN mapping #6

Open dannert opened 4 years ago

dannert commented 4 years ago

I deployed a mongodb-dev helm chart with the 1.0.2 driver. After successful deployment I killed the server hosting the MongoDB to simulate a server failure and observe recovery behavior.

Based on my testing, the MongoDB database does not recover without significant manual intervention because Flex driver does not force an unmap / remap of the LUN to the new worker node.

Issue 1) Deleting the POD from CLI hangs. K8S notices the delete, puts the POD into Terminating and starts a new POD on another worker. The creation of that new POD fails as the Flex driver does not force unmap the LUN - even with worker node down and POD Terminating - and does not map the LUN to the new worker. Question is, why does the POD deletion hang - could that be in the Flex driver? Issue 2) No forced unmap / re-map.

The only way to cleanly, without manual intervention, get the MongoDB running again is to restart the "failed" worker node. At that point the "Delete POD" command completes and the LUN is successfully re-mapped. I believe that in "real life" it is somewhat unlikely that a server which "died" comes back in a short time frame --> any application relying on that MongoDB would be hanging.

Screenshot of new MongoDB POD log illustrating the issue: image