apache / kvrocks

Apache Kvrocks is a distributed key value NoSQL database that uses RocksDB as storage engine and is compatible with Redis protocol.
https://kvrocks.apache.org/
Apache License 2.0
3.47k stars 449 forks source link

(m)TLS replication is broken in 2.9.0 #2490

Open kinoute opened 1 month ago

kinoute commented 1 month ago

Search before asking

Version

2.9.0

Minimal reproduce step

When upgrading Kvrocks from 2.8.0 to 2.9.0, we started to get SSL/TLS errors when trying to connect a slave to the master. No problem on replication without TLS.

Both master and slave are on 2.9.0. Rolling back the master to 2.8.0 and keeping the replica on 2.9.0 is working so it is definitely on the "server/master" part.

When both were on 2.9.0, using redis-cli on the slave instance to connect to the master was working with the certificates so they are fine:

No errors (on replica instance):

redis-cli -h kvrocks-master \
  -p 6379 \
   --tls \
   --cacert /ca/kvrocks/ca.crt \
   --cert /tls/kvrocks/tls.crt \
   --key /tls/kvrocks/tls.key

Errors (replica instance, see below):

kvrocks -c kvrocks.conf \
      --dir /var/lib/kvrocks \
      --pidfile /var/run/kvrocks/kvrocks.pid \
      --masterauth "xxx" \
      --slaveof "kvrocks-master 6379" \
      --tls-ca-cert-file /ca/kvrocks/ca.crt \
      --tls-key-file /tls/kvrocks/tls.key \
      --tls-cert-file /tls/kvrocks/tls.crt \
      --tls-replication yes \
      --bind 0.0.0.0

What did you expect to see?

A working (m)TLS replication that either does psync or full synchronization

What did you see instead?

Server (MASTER) :

kvrocks I20240813 08:37:26.913834 121 cmd_replication.cc:60] Slave 100.65.46.20:45098, listening port: 6379, announce ip: 100.65.46.20 asks for synchronization with next sequence: 1 replication id: not supported, and local sequence: 344837857 kvrocks E20240813 08:37:26.918999 121 redis_connection.cc:109] [connection] Going to remove the client: 100.65.46.20:45098, while encounter error: Success, SSL Error: error:0A000126:SSL routines::unexpected eof while reading
kvrocks I20240813 08:37:26.986922 193 cmd_replication.cc:242] [replication] Succeed sending full data file info to 100.65.46.20 kvrocks W20240813 08:37:27.038514 194 cmd_replication.cc:299] [replication] Fail to send file CURRENT to 100.65.46.20, error: Success
kvrocks I20240813 08:37:37.086854 195 cmd_replication.cc:242] [replication] Succeed sending full data file info to 100.65.46.20 kvrocks W20240813 08:37:37.127951 196 cmd_replication.cc:299] [replication] Fail to send file CURRENT to 100.65.46.20, error: Success

Client (REPLICA) :

W20240813 08:00:12.653694 50 replication.cc:935] [fetch] Fail to fetch file 005813.sst, err: fetch file err: read sst file: failed to read from SSL connection: error:00000000:lib(0)::reason(0) W20240813 08:00:12.655525 49 replication.cc:935] [fetch] Fail to fetch file 009665.sst, err: fetch file err: read sst file: failed to read from SSL connection: error:00000000:lib(0)::reason(0) W20240813 08:00:12.660212 51 replication.cc:935] [fetch] Fail to fetch file 008792.sst, err: fetch file err: read sst file: failed to read from SSL connection: error:00000000:lib(0)::reason(0) W20240813 08:00:12.661721 52 replication.cc:935] [fetch] Fail to fetch file 005736.sst, err: fetch file err: read sst file: failed to read from SSL connection: error:00000000:lib(0)::reason(0)

Anything Else?

Is it safe to downgrade to 2.8.0 on instances where I need (m)tls replication? Could it be due to the switch to Debian?

Are you willing to submit a PR?

PragmaTwice commented 1 month ago

Could it be due to the switch to Debian?

Have you tried to build kvrocks in your own environment and see if TLS replication works well?

kinoute commented 1 month ago

Could it be due to the switch to Debian?

Have you tried to build kvrocks in your own environment and see if TLS replication works well?

We use Kvrocks in Kubernetes with the official Docker images

kinoute commented 1 month ago

I tried with the unstable/nightly Docker tag for the master instance: I don't have the error. I then rollbacked to 2.9.0, the error is gone and does not reappear. Really weird.

I still have some running instances (on other clusters) where I didn't upgrade to nightly so the problem is still there, do you want me to run some checks/commands in order to get some ideas about why this is happening?

Edit: The problem is still here, nevermind

kinoute commented 1 month ago

I built Kvrocks 2.9.0 with the Docker Alpine image from 2.8: no SSL/TLS replication errors. The image is here: https://hub.docker.com/r/hivacruz/kvrocks-alpine/tags