Orange-OpenSource / casskop

This Kubernetes operator automates the Cassandra operations such as deploying a new rack aware cluster, adding/removing nodes, configuring the C* and JVM parameters, upgrading JVM and C* versions, and many more...
https://orange-opensource.github.io/casskop/
Apache License 2.0
183 stars 54 forks source link

Use readiness script for liveness probe too #234

Closed cscetbon closed 4 years ago

cscetbon commented 4 years ago

| Bug fix? | yes

What's in this PR?

This change avoids having k8s pods to fail when they are decommissioning/joining the cluster. Thanks to CASSANDRA-7069 Cassandra will prevent 2 nodes from attempting to join the cluster in parallel

Additional context

The current PR goal is to avoid having pods restarted when they take too long to join or leave a cluster because of the liveness timeout

ahmedjami commented 4 years ago

A decommissioned node appears in LeavingNode during decommission operation before it passed on UnreachableNodes status for a while. Since your check didn't find the node on LiveNodes status and returns 1, your liveness will fail and kubelet will try to restart the pod endlessly... Even node stopped appears on Unreachable status. This is the behaviour that I can see when we request status from jmx within StorageServiceMbean.

I need also to test it within a cluster and casskop to confirm what I already mentioned above.

cscetbon commented 4 years ago

@ahmedjami I made some tests and a node that is leaving or joining stays in the LiveNodes list as long as it's alive and in the ring, so it won't be a problem. Look at those logs https://pastebin.com/raw/8saiJXgS to see it in action.

ahmedjami commented 4 years ago

@cscetbon, yes the node appears in both status: Leaving and Live Nodes when we decommission it. So based on what you check on liveness probe, this will achieve the purpose here :)