Ericsson / ecchronos

Ericsson distributed repair scheduler for Apache Cassandra
Apache License 2.0
29 stars 36 forks source link

First node require restart after Cassandra cluster created, when using TLS #185

Closed etedpet closed 3 years ago

etedpet commented 3 years ago

Running Cassandra in a Kubernetes cluster where the Cassandra process and ecChronos process run in the same Pod but in 2 different containers. When creating the cluster, one Pod (Cassandra node + ecChronos instance scheduling repair on that node) is started at a time. So the first ecChronos instence is started "together with" the first C* node.

When deploying with TLS enabled on the cql interface: (from security.yml)

cql:
  credentials:
    enabled: false
  tls:
    enabled: true
    algorithm: SunX509
    cipher_suites: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
    keystore: /usr/share/ecchronos/tls/keystore.jks
    keystore_password: hTw3cpryY28Ihy7yNXQxVpqpww1PhpfQ50LI8nawzIQivIxRUCc4I2VbmqukIDjfOmIWtHMArFTwlW5HBmoZN3YmBcAywQ5di5BC
    protocol: TLSv1.2
    require_endpoint_verification: true
    store_type: JKS
    truststore: /usr/share/ecchronos/tls/truststore.jks
    truststore_password: naEAKG9onzZcIBhmAYkI5g5mGLP0OnkqjdQrjaQUjQwUhSonTVQcqesOgWXTJV43CGefg1rGQeVbtn7wGLj2xEmIVLs2LHR4F7sN
jmx:
  credentials:
    enabled: false
  tls:
    enabled: false

The first ecChronos connected to the first C* node, will not work properly. For all keyspaces/tables created after ecChronos have been started (in practice all tables of interest):

But the second node that comes up (C* and ecChronos) works as expected!

The issue seems to be that the Cassandra driver for some reason does not recognize schema changes!? Logs below show how the driver reacts to a keyspace/table (ks.tb1) being created after the C* cluster is up (all nodes joined).

Startup logs from the first (non working) ecChronos node:

[com.datastax.driver.core.Cluster] Received event EVENT CREATED KEYSPACE ks, scheduling delivery​ [com.datastax.driver.core.Cluster] Received event EVENT CREATED TABLE ks.tb1, scheduling delivery​​​​​​ ...

Startup logs from the second (working) ecChronos node:

[com.datastax.driver.core.Cluster] Received event EVENT CREATED KEYSPACE ks, scheduling delivery [com.datastax.driver.core.Cluster] Received event EVENT CREATED TABLE ks.tb1, scheduling delivery [com.datastax.driver.core.ControlConnection] [Control connection] Refreshing schema for ks ...

The second node will trigger the ecChronos DefaultRepairConfigurationProvider#onTableAdded callback, to handle the new table(s).

This does not happen for the first node for some reason. And if ecChronos on the first node does not know of any "new tables" it cannot manage those tables.

The only way to work around this issue (on the first node) is to restart it (restart the container or pod in the k8s case)

etedpet commented 3 years ago

A workaround for this issue is to disable endpoint verification:

cql:
  tls:
    require_endpoint_verification: false
itskarlsson commented 3 years ago

The java driver log showed SSL errors(hostname verification failed) as each node could only connect to the local cassandra node. I suspect the issue lies with this, further supported by the fact that this cannot be reproduced without TLS or without require_endpoint_verification.

I could also not reproduce this issue on my local CCM with certificates enabled.

If you see this issue again without these SSL issues, please do not hesitate to reopen the ticket.