[BUG] Casskop fails to clean up PVC and refuses to handle user requests after crash and restart

srteam2020 commented 3 years ago

Bug Report

We find that scaling down a single dc rack (by reducing nodesPerRacks) might end up in a dirty state (the pod is deleted but the pvc is still there) if the operator crashes in the middle of a reconcile and restarts. The accidental dirty state will also prevent the operator from handling any future user request.

More concretely, when scaling down the dc rack (statefulset), casskop will do the following:

detect there is a decommission task, set the in-memory CR.podLastOperation.status (previously StatusOngoing) to StatusFinalizing (but have not issued Update to k8s yet)
update statefulset's replica to delete the decommissioned pod
update CR at the end of reconcile, which will persist the change topodLastOperation (to StatusFinalizing)
in the next round of reconcile:
1. if podLastOperation is StatusOngoing, try to get the decommissioned pod. If there is an error, the operator directly returns the error. Otherwise, the operator continues its reconcile.
2. if podLastOperation is StatusFinalizing, try to get the decommissioned pod. If encounter NotFound error, delete the PVC and set podLastOperation.status to statusDone

Say we set nodesPerRacks from 2 to 1. The operator will run the above steps. If the operator pod crashes after step 2, the decommissioned pod will be deleted (as the statefulset is resized), but podLastOperation is still StatusOngoing (since 3 is not executed yet). After the operator pod restarts, it will go to branch 4.i, and since the last pod is already deleted, there will be NotFound error when trying to get the last pod. The operator simply ends this round of reconcile by returning with the error and is never able to clean up the PVC or serve further user requests.

What did you do? Set the nodesPerRacks from 2 to 1

What did you expect to see? The pod and the pvc get deleted.

What did you see instead? Under which circumstances? The pod is deleted but the pvc is still there. And any further user operation is refused by the operator.

Environment

casskop version: f87c8e05c1a2896732fc5f3a174f1eb99e936907 (master branch)
Kubernetes version information: 1.18.9
Cassandra version: 3.11

Possible Solution A potential solution is to directly issue Update after changing CR.podLastOperation.status to StatusFinalizing in step 1. So that even if the operator crashes in the middle of reconcile, it should still be able to resize the statefulset and delete the pvc, and move to StatusDone eventually.

Additional context We are willing to send a PR to help fix this issue.

cscetbon commented 3 years ago

Can you confirm it happens when deletePVC is set to true because otherwise it's expected

srteam2020 commented 3 years ago

@cscetbon Thanks for reply!

Can you confirm it happens when deletePVC is set to true because otherwise it's expected

Yes we set deletePVC to true and the PVC is supposed to be deleted. The PVC does not get deleted because the controller crashes at a particular point and cannot fullfill all the reconcile updates. We have read the source code very carefully to draw the conclusion. More concretely, decommissioned pod is deleted and podLastOperation is still StatusOngoing. Although the controller can restart, it cannot make progress to delete the PVC from this inconsistent state.

We are currently trying to send a PR to fix it. A potential approach is to switch the update/delete order to avoid the inconsistent state.

srteam2020 commented 3 years ago

This bug is hard to trigger as it only manifest when crash happens at particular timing. But once triggered, the controller will not be able to recover. We actually have an open-sourced tool that can reliably reproduce this bug (when deletePVC set to true) which helps us diagnosis the problem. Please let us know if you also want to reliably reproduce the bug and we can help you on that.

Orange-OpenSource / casskop

[BUG] Casskop fails to clean up PVC and refuses to handle user requests after crash and restart #370

Bug Report