Orange-OpenSource / casskop

This Kubernetes operator automates the Cassandra operations such as deploying a new rack aware cluster, adding/removing nodes, configuring the C* and JVM parameters, upgrading JVM and C* versions, and many more...
https://orange-opensource.github.io/casskop/
Apache License 2.0
183 stars 54 forks source link

Cassandra nodes cannot form a cluster after simultaneous restart #384

Closed toha10 closed 2 years ago

toha10 commented 2 years ago

Bug Report

What did you do? We have deployed cassandra cluster with 3 nodes in rack1 dc1. Each pod is running on different k8s worker node. We make simultaneous reboot off all 3 k8s workers.

What did you expect to see? All 3 cassandra pods came up and succesefully synced

What did you see instead? Under which circumstances? All 3 cassandra pods started bootstrap process simultaneously and were not able to check connectivity to each other. Link to seed election: https://github.com/Orange-OpenSource/casskop/blob/master/docker/bootstrap/files/run.sh#L31 This way every cassandra node has chosen only itself as a seed node. As a result we've got desynced cluster, because newly spawned pods have received new ip addresses. Output of "nodetool status" after simultaneous reboot from host a58343d0-1e3f-4d54-bcdf-9b9b949ca873:

Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens       Owns (effective)  Host ID                               Rack
DN  <old ip address> ?          256          65.2%             7324ebc4-577a-425f-b3de-96faac95a331  r1
DN  <old ip address> ?          256          69.8%             67f1d07c-8b13-4482-a2f1-77fa34e90d48  r1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens       Owns (effective)  Host ID                               Rack
UN  <new ip address> 3.57 GiB   256          64.9%             a58343d0-1e3f-4d54-bcdf-9b9b949ca873  rack1

Other two neighbours has similar status: node itself is in UN state with actual ip and other nodes with old ip addresses in DN state.

Environment

Possible Solution Casskop monitors hostId map and compares information from jolokia with actual cassandra pod ips. If there is a mismatch, probably operator could restart one of the nodes to re-trigger boostrap. This way restarted node will make a handshake with other nodes and update host ips.

fdehay commented 2 years ago

hello @toha10, Thanks for using CassKop! Looking at the issue, it seems to me that a simultaneous restart of all nodes in K8S is not really a production use case. That's the idea behind racks in Cassandra: you are not supposed to restart all node in all racks at the same time.

So this issue will be ignored on our side. I am sorry but we are running very low in resources on this project and we struggle to keep up with others PRS.

If you want you can submit a PR on this but I am not sure we will be able to include it Regards

cscetbon commented 2 years ago

Yes you should only do a rolling restart if you really want to restart them all. Closing this issue