Closed cscetbon closed 4 years ago
A decommissioned node appears in LeavingNode during decommission operation before it passed on UnreachableNodes status for a while. Since your check didn't find the node on LiveNodes status and returns 1, your liveness will fail and kubelet will try to restart the pod endlessly... Even node stopped appears on Unreachable status. This is the behaviour that I can see when we request status from jmx within StorageServiceMbean.
I need also to test it within a cluster and casskop to confirm what I already mentioned above.
@ahmedjami I made some tests and a node that is leaving or joining stays in the LiveNodes list as long as it's alive and in the ring, so it won't be a problem. Look at those logs https://pastebin.com/raw/8saiJXgS to see it in action.
@cscetbon, yes the node appears in both status: Leaving and Live Nodes when we decommission it. So based on what you check on liveness probe, this will achieve the purpose here :)
| Bug fix? | yes
What's in this PR?
This change avoids having k8s pods to fail when they are decommissioning/joining the cluster. Thanks to CASSANDRA-7069 Cassandra will prevent 2 nodes from attempting to join the cluster in parallel
Additional context
The current PR goal is to avoid having pods restarted when they take too long to join or leave a cluster because of the liveness timeout