When doing a rolling upgrade from 24.8.5.115 to 24.9.x (or higher) in a cluster of 3 Clickhouse Keeper instances, the first node to be upgraded to the new version (24.9.x) fails to rejoin the cluster, while logging errors about Raft CRC mismatches and "wrong version number (SSL routines)". See the error logs at the end of this issue.
We were able to reproduce this with all combinations of these versions:
v24.9.1.3278-stable, v24.9.2.42-stable, v24.9.3.128-stable, v24.10.3.21-stable, 24.11.1.2557 for the node that gets upgraded ("new version")
v24.8.5.115-lts and v24.8.7.41-lts for the rest of the cluster ("old version")
Describe what's wrong
When doing a rolling upgrade from 24.8.5.115 to 24.9.x (or higher) in a cluster of 3 Clickhouse Keeper instances, the first node to be upgraded to the new version (24.9.x) fails to rejoin the cluster, while logging errors about Raft CRC mismatches and "wrong version number (SSL routines)". See the error logs at the end of this issue.
We were able to reproduce this with all combinations of these versions:
The issue can be reproduced with this simple docker setup.
Raft config:
The full configuration can be found in the demo project.
Does it reproduce on the most recent release?
Yes, we also tested this for version 24.11.1.2557 (for the upgrading node, against a 24.8 cluster).
Expected behavior
We'd expect not to see the Raft issues logged and for the upgraded node to join the keeper cluster.
Error message and/or stacktrace
On the node that has the new version (
clickhouse-keeper3
in the test project):On the other nodes with the old versions (
clickhouse-keeper1
andclickhouse-keeper2
in the test project):