Cassandra nodes cannot form a cluster after simultaneous restart

toha10 commented 2 years ago

Bug Report

What did you do? We have deployed cassandra cluster with 3 nodes in rack1 dc1. Each pod is running on different k8s worker node. We make simultaneous reboot off all 3 k8s workers.

What did you expect to see? All 3 cassandra pods came up and succesefully synced

What did you see instead? Under which circumstances? All 3 cassandra pods started bootstrap process simultaneously and were not able to check connectivity to each other. Link to seed election: https://github.com/Orange-OpenSource/casskop/blob/master/docker/bootstrap/files/run.sh#L31 This way every cassandra node has chosen only itself as a seed node. As a result we've got desynced cluster, because newly spawned pods have received new ip addresses. Output of "nodetool status" after simultaneous reboot from host a58343d0-1e3f-4d54-bcdf-9b9b949ca873:

Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens       Owns (effective)  Host ID                               Rack
DN  <old ip address> ?          256          65.2%             7324ebc4-577a-425f-b3de-96faac95a331  r1
DN  <old ip address> ?          256          69.8%             67f1d07c-8b13-4482-a2f1-77fa34e90d48  r1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens       Owns (effective)  Host ID                               Rack
UN  <new ip address> 3.57 GiB   256          64.9%             a58343d0-1e3f-4d54-bcdf-9b9b949ca873  rack1

Other two neighbours has similar status: node itself is in UN state with actual ip and other nodes with old ip addresses in DN state.

Environment

casskop version: v2.0.2-release
cassandra version: 3.11.10
bootstrap version: 0.1.9
Kubernetes version: v1.18.19 Release

Possible Solution Casskop monitors hostId map and compares information from jolokia with actual cassandra pod ips. If there is a mismatch, probably operator could restart one of the nodes to re-trigger boostrap. This way restarted node will make a handshake with other nodes and update host ips.

fdehay commented 2 years ago

hello @toha10, Thanks for using CassKop! Looking at the issue, it seems to me that a simultaneous restart of all nodes in K8S is not really a production use case. That's the idea behind racks in Cassandra: you are not supposed to restart all node in all racks at the same time.

So this issue will be ignored on our side. I am sorry but we are running very low in resources on this project and we struggle to keep up with others PRS.

If you want you can submit a PR on this but I am not sure we will be able to include it Regards

cscetbon commented 2 years ago

Yes you should only do a rolling restart if you really want to restart them all. Closing this issue

Orange-OpenSource / casskop

Cassandra nodes cannot form a cluster after simultaneous restart #384

Bug Report