Cassandra nodes becomes unreachable to each other

behko commented 5 years ago

I have 3 nodes of elassandra running in docker containers.

Containers created like:

Host 10.0.0.1 : docker run --name elassandra-node-1 --net=host -e CASSANDRA_SEEDS="10.0.0.1" -e CASSANDRA_CLUSTER_NAME="BD Storage" -e CASSANDRA_DC="DC1" -e CASSANDRA_RACK="r1" -d strapdata/elassandra:latest

Host 10.0.0.2 : docker run --name elassandra-node-2 --net=host -e CASSANDRA_SEEDS="10.0.0.1,10.0.0.2" -e CASSANDRA_CLUSTER_NAME="BD Storage" -e CASSANDRA_DC="DC1" -e CASSANDRA_RACK="r1" -d strapdata/elassandra:latest

Host 10.0.0.3 : docker run --name elassandra-node-3 --net=host -e CASSANDRA_SEEDS="10.0.0.1,10.0.0.2,10.0.0.3" -e CASSANDRA_CLUSTER_NAME="BD Storage" -e CASSANDRA_DC="DC1" -e CASSANDRA_RACK="r1" -d strapdata/elassandra:latest

Cluster was working fine for a couple of days since created, elastic, cassandra all was perfect.

Currently however all cassandra nodes became unreachable to each other: Nodetool status on all nodes is like

Datacenter: DC1

Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack DN 10.0.0.3 11.95 GiB 8 100.0% 7652f66e-194e-4886-ac10-0fc21ac8afeb r1 DN 10.0.0.2 11.92 GiB 8 100.0% b91fa129-1dd0-4cf8-be96-9c06b23daac6 r1 UN 10.0.0.1 11.9 GiB 8 100.0% 5c1afcff-b0aa-4985-a3cc-7f932056c08f r1

Where the UN is the current host 10.0.0.1 Same on all other nodes.

Nodetool describecluster on 10.0.0.1 is like

Cluster Information: Name: BD Storage Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch DynamicEndPointSnitch: enabled Partitioner: org.apache.cassandra.dht.Murmur3Partitioner Schema versions: 24fa5e55-3935-3c0e-9808-99ce502fe98d: [10.0.0.1]
            UNREACHABLE: [10.0.0.2,10.0.0.3]

When attached to the first node its only repeating these infos:

2018-12-09 07:47:32,927 WARN [OptionalTasks:1] org.apache.cassandra.auth.CassandraRoleManager.setupDefaultRole(CassandraRoleManager.java:361) CassandraRoleManager skipped default role setup: some nodes were not ready 2018-12-09 07:47:32,927 INFO [OptionalTasks:1] org.apache.cassandra.auth.CassandraRoleManager$4.run(CassandraRoleManager.java:400) Setup task failed with error, rescheduling 2018-12-09 07:47:32,980 INFO [HANDSHAKE-/10.0.0.2] org.apache.cassandra.net.OutboundTcpConnection.lambda$handshakeVersion$1(OutboundTcpConnection.java:561) Handshaking version with /10.0.0.2 2018-12-09 07:47:32,980 INFO [HANDSHAKE-/10.0.0.3] org.apache.cassandra.net.OutboundTcpConnection.lambda$handshakeVersion$1(OutboundTcpConnection.java:561) Handshaking version with /10.0.0.3

After a while when some node is restarted:

2018-12-09 07:52:21,972 WARN [MigrationStage:1] org.apache.cassandra.service.MigrationTask.runMayThrow(MigrationTask.java:67) Can't send schema pull request: node /10.0.0.2 is down.

Tried so far: Restarting all containers at the same time Restarting all containers one after another Restarting cassandra in all containers like : service cassandra restart Nodetool disablegossip then enable it Nodetool repair : Repair command #1 failed with error Endpoint not alive: /10.0.0.2

Seems that all node schemas are different, but I still dont understand why they are marked as down to each other.

wglambert commented 5 years ago

Could you post your docker-compose.yml

This looks like your issue if that 10.0.0.* address is from the overlay network: https://github.com/docker-library/cassandra/issues/168 and also https://github.com/docker-library/cassandra/issues/169

wglambert commented 5 years ago

I think you need a -e CASSANDRA_BROADCAST_ADDRESS=10.0.0.*

Under the section "For separate machines (ie, two VMs ..." https://github.com/docker-library/docs/tree/master/cassandra#make-a-cluster

tianon commented 5 years ago

I don't really see anything we can change in the image to make this easier, unfortunately. The best I can recommend from here is to try the Docker Community Forums, the Docker Community Slack, or Stack Overflow for further help setting up and configuring a cluster.

tianon commented 5 years ago

(Additionally, strapdata/elassandra:latest is not this image.)

docker-library / cassandra

Cassandra nodes becomes unreachable to each other #171

Datacenter: DC1