Snapchat / KeyDB

A Multithreaded Fork of Redis
https://keydb.dev
BSD 3-Clause "New" or "Revised" License
11.02k stars 564 forks source link

Keydb restarts after 52 days exactly every time for months #793

Open agosalvez opened 4 months ago

agosalvez commented 4 months ago

Describe the bug

I have a keydb cluster with 3 replicas deployed in kubernetes for more than 1 year. Every once in a while it would restart without knowing why until I discovered a pattern, every 52 days exactly and I don't know why.

To reproduce

Run the cluster and wait 52 days.

Expected behavior

restart keydb

Additional information

I don't have any cron task or anything, I've only added configuration at runtime and no parameters have to do with restart.

@JohnSully could you tell me if keydb has any variable or anything for restarting after 52 days? I thought can be a memory insufficient or something like that so the cluster is deployed in GKE. Since july 2023, I realized that every 52 days restarts, keydb works good after that but I would like to know why.

Thank you!

my server.sh file:

#!/bin/bash
set -euxo pipefail

host="$(hostname)"
port="6379"
replicas=()
for node in {0..2}; do
  if [ "${host}" != "keydb-${node}" ]; then
      replicas+=("--replicaof keydb-${node}.keydb-headless ${port}")
  fi
done
exec keydb-server /etc/keydb/redis.conf \
    --active-replica "yes" \
    --multi-master "yes" \
    --appendonly "no" \
    --bind "0.0.0.0" \
    --port "${port}" \
    --protected-mode "no" \
    --requirepass ${REDIS_PASSWORD} \
    --masterauth ${REDIS_PASSWORD} \
    --server-threads "2" \
    --client-output-buffer-limit replica 1024mb 1024mb 0 \
    --client-output-buffer-limit pubsub 1024mb 1024mb 0 \
    --loadmodule /usr/lib/redis/modules/librejson.so \
    --save "" \
    --save 900 1 \
    --repl-backlog-size 50mb \
    "${replicas[@]}"
JohnSully commented 3 months ago

Hey this is pretty interesting, I think we do deploys much less often than this so I haven't seen it.

Do you have logs from the time just before it restarts?

agosalvez commented 3 months ago

No, because the log contains a lot of text, but it was just a clean signal signal."

I have a custom Docker image on Ubuntu 20.04, with KeyDB 6.3.1 and the Redis reJSON module at version 2.0.11.

Is it possible for the OS to produce this SIGTERM? I've been searching for something related to it on the Internet for the last few days, but haven't found anything. Could it be the reJSON module attached to KeyDB?

Thanks you in advance John!

air3ijai commented 3 months ago

Can you check GKE logs and Pod termination reason?

Which liveness, readiness probes do you use?

agosalvez commented 3 months ago

Yes, of course!

I have a keydb cluster with 3 pods in statefulset.

This is my pod0, I have keydb-0, keydb-1 and keydb-2.

The pods it was restarted similar mode in cascade, first keydb-0, keydb-1 and keydb-2 in this order.

I think maybe it could be cause by SO from container.... I do not know exactly why, I have not read that unix system can be restart by itself. I do not know that to think.

My log from keydb-0:

INFO 2024-03-04T17:55:58.116242977Z 1:27:S 04 Mar 2024 17:55:58.058 1 changes in 900 seconds. Saving... INFO 2024-03-04T17:55:58.116277976Z 1:27:S 04 Mar 2024 17:55:58.094 Background saving started INFO 2024-03-04T17:56:03.682936222Z 1:3903722:S 04 Mar 2024 17:56:03.682 DB saved on disk INFO 2024-03-04T17:56:03.701375413Z 1:3903722:S 04 Mar 2024 17:56:03.701 RDB: 823 MB of memory used by copy-on-write INFO 2024-03-04T17:56:03.771585783Z 1:27:S 04 Mar 2024 17:56:03.771 Background saving terminated with success INFO 2024-03-04T17:56:27.911141759Z 1:signal-handler (1709574987) Received SIGTERM scheduling shutdown... INFO 2024-03-04T17:56:27.978116158Z NOTICE: Detuning locks due to high load per core: 120.28% INFO 2024-03-04T17:56:27.978166246Z 1:27:S 04 Mar 2024 17:56:27.977 # User requested shutdown... INFO 2024-03-04T17:56:27.978175634Z 1:27:S 04 Mar 2024 17:56:27.977 Saving the final RDB snapshot before exiting. ERROR 2024-03-04T17:56:33.112669289Z [resource.labels.containerName: redis-exporter] time="2024-03-04T17:56:33Z" level=error msg="Couldn't connect to redis instance" INFO 2024-03-04T17:56:33.548772770Z 1:27:S 04 Mar 2024 17:56:33.548 DB saved on disk INFO 2024-03-04T17:56:33.575917652Z 1:27:S 04 Mar 2024 17:56:33.575 Removing the pid file. INFO 2024-03-04T17:56:33.730320281Z 1:27:S 04 Mar 2024 17:56:33.730 # KeyDB is now ready to exit, bye bye... ERROR 2024-03-04T17:56:35.263807646Z ++ hostname ERROR 2024-03-04T17:56:35.272187215Z + host=keydb-0 ERROR 2024-03-04T17:56:35.480011077Z + port=6379 ERROR 2024-03-04T17:56:35.480030991Z + replicas=() ERROR 2024-03-04T17:56:35.480037601Z + for node in {0..2} ERROR 2024-03-04T17:56:35.480085599Z + '[' keydb-0 '!=' keydb-0 ']' ERROR 2024-03-04T17:56:35.480093943Z + for node in {0..2} ERROR 2024-03-04T17:56:35.480115716Z + '[' keydb-0 '!=' keydb-1 ']' ERROR 2024-03-04T17:56:35.480122399Z + replicas+=("--replicaof keydb-${node}.keydb-headless ${port}") ERROR 2024-03-04T17:56:35.480127517Z + for node in {0..2} ERROR 2024-03-04T17:56:35.480132612Z + '[' keydb-0 '!=' keydb-2 ']' ERROR 2024-03-04T17:56:35.480137826Z + replicas+=("--replicaof keydb-${node}.keydb-headless ${port}") ERROR 2024-03-04T17:56:35.480145280Z + exec keydb-server /etc/keydb/redis.conf --active-replica yes --multi-master yes --appendonly no --bind 0.0.0.0 --port 6379 --protected-mode no --requirepass bots --masterauth bots --server-threads 2 --client-output-buffer-limit replica 1024mb 1024mb 0 --client-output-buffer-limit pubsub 1024mb 1024mb 0 --loadmodule /usr/lib/redis/modules/librejson.so --save '' --save 900 1 --repl-backlog-size 50mb '--replicaof keydb-1.keydb-headless 6379' '--replicaof keydb-2.keydb-headless 6379' INFO 2024-03-04T17:56:35.722760595Z 1:1:C 04 Mar 2024 17:56:35.722 Notice: "active-replica yes" implies "replica-read-only no" INFO 2024-03-04T17:56:35.722793703Z 1:1:C 04 Mar 2024 17:56:35.722 Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer. INFO 2024-03-04T17:56:35.722825503Z 1:1:C 04 Mar 2024 17:56:35.722 # oO0OoO0OoO0Oo KeyDB is starting oO0OoO0OoO0Oo INFO 2024-03-04T17:56:35.722836370Z 1:1:C 04 Mar 2024 17:56:35.722 # KeyDB version=6.3.1, bits=64, commit=ee16abf0, modified=1, pid=1, just started INFO 2024-03-04T17:56:35.722840820Z 1:1:C 04 Mar 2024 17:56:35.722 # Configuration loaded INFO 2024-03-04T17:56:35.724486510Z 1:1:S 04 Mar 2024 17:56:35.724 monotonic clock: POSIX clockgettime INFO 2024-03-04T17:56:36.255330155Z INFO 2024-03-04T17:56:36.255407292Z INFO 2024-03-04T17:56:36.255415081Z -(+)- INFO 2024-03-04T17:56:36.255420484Z -- / \ -- INFO 2024-03-04T17:56:36.255425835Z -- / \ -- KeyDB 6.3.1 (ee16abf0/1) 64 bit INFO 2024-03-04T17:56:36.255431248Z -- / \ -- INFO 2024-03-04T17:56:36.255437953Z (+) / \ (+) Running in standalone mode INFO 2024-03-04T17:56:36.255444862Z | -- / \ -- | Port: 6379 INFO 2024-03-04T17:56:36.255451976Z | /-- --\ | PID: 1 INFO 2024-03-04T17:56:36.255458258Z | / -(+)- \ | INFO 2024-03-04T17:56:36.255464148Z | / | \ | https://docs.keydb.dev INFO 2024-03-04T17:56:36.255470927Z | / | \ | INFO 2024-03-04T17:56:36.255476851Z | / | \ | INFO 2024-03-04T17:56:36.255482357Z (+) -- -- -- | -- -- -- (+) INFO 2024-03-04T17:56:36.255488019Z -- | -- INFO 2024-03-04T17:56:36.255493364Z -- | _-- INFO 2024-03-04T17:56:36.255499645Z -(+)- KeyDB has now joined Snap! See the announcement at: https://docs.keydb.dev/news INFO 2024-03-04T17:56:36.255509902Z {} INFO 2024-03-04T17:56:36.255532142Z INFO 2024-03-04T17:56:36.255537903Z 1:1:S 04 Mar 2024 17:56:36.255 # Server initialized INFO 2024-03-04T17:56:36.379600093Z 1:1:S 04 Mar 2024 17:56:36.379 version: 20011 git sha: 25ff494 branch: HEAD INFO 2024-03-04T17:56:36.379645662Z 1:1:S 04 Mar 2024 17:56:36.379 Exported RedisJSON_V1 API INFO 2024-03-04T17:56:36.379661280Z 1:1:S 04 Mar 2024 17:56:36.379 Enabled diskless replication INFO 2024-03-04T17:56:36.379670737Z 1:1:S 04 Mar 2024 17:56:36.379 Created new data type 'ReJSON-RL' INFO 2024-03-04T17:56:36.379713853Z 1:1:S 04 Mar 2024 17:56:36.379 Module 'ReJSON' loaded from /usr/lib/redis/modules/librejson.so INFO 2024-03-04T17:56:36.381189135Z 1:1:S 04 Mar 2024 17:56:36.381 Loading RDB produced by version 6.3.1 INFO 2024-03-04T17:56:36.381221014Z 1:1:S 04 Mar 2024 17:56:36.381 RDB age 9 seconds INFO 2024-03-04T17:56:36.381230619Z 1:1:S 04 Mar 2024 17:56:36.381 RDB memory usage when created 575.37 Mb INFO 2024-03-04T17:56:41.462007022Z 1:1:S 04 Mar 2024 17:56:41.461 # Done loading RDB, keys loaded: 0, keys expired: 0. INFO 2024-03-04T17:56:41.462041701Z 1:1:S 04 Mar 2024 17:56:41.461 DB loaded from disk: 5.081 seconds INFO 2024-03-04T17:56:41.462049865Z 1:1:S 04 Mar 2024 17:56:41.461 RDB contains information on 2 masters INFO 2024-03-04T17:56:41.462057316Z 1:1:S 04 Mar 2024 17:56:41.461 Cached master recovered from RDB for keydb-1.keydb-headless:6379 INFO 2024-03-04T17:56:41.462063966Z 1:1:S 04 Mar 2024 17:56:41.461 Cached master recovered from RDB for keydb-2.keydb-headless:6379 INFO 2024-03-04T17:56:41.462245904Z 1:26:S 04 Mar 2024 17:56:41.462 Thread 0 alive. INFO 2024-03-04T17:56:41.462364984Z 1:27:S 04 Mar 2024 17:56:41.462 Thread 1 alive. INFO 2024-03-04T17:56:41.462889891Z 1:26:S 04 Mar 2024 17:56:41.462 Connecting to MASTER keydb-1.keydb-headless:6379 INFO 2024-03-04T17:56:41.464720716Z 1:26:S 04 Mar 2024 17:56:41.464 MASTER <-> REPLICA sync started INFO 2024-03-04T17:56:41.465356779Z NOTICE: Detuning locks due to high load per core: 99.47% INFO 2024-03-04T17:56:41.465380488Z 1:26:S 04 Mar 2024 17:56:41.465 Non blocking connect for SYNC fired the event. INFO 2024-03-04T17:56:41.493757181Z 1:26:S 04 Mar 2024 17:56:41.493 Master replied to PING, replication can continue... INFO 2024-03-04T17:56:41.494408927Z 1:26:S 04 Mar 2024 17:56:41.494 Trying a partial resynchronization (request eb7fb82c11e3eea7aca334890d62e53ceedf554c:1181006434860). INFO 2024-03-04T17:56:41.494846951Z 1:26:S 04 Mar 2024 17:56:41.494 Successful partial resynchronization with master. INFO 2024-03-04T17:56:41.495056690Z 1:26:S 04 Mar 2024 17:56:41.494 MASTER <-> REPLICA sync: Master accepted a Partial Resynchronization. INFO 2024-03-04T17:56:42.474629340Z 1:26:S 04 Mar 2024 17:56:42.474 Connecting to MASTER keydb-2.keydb-headless:6379 INFO 2024-03-04T17:56:42.475481177Z 1:26:S 04 Mar 2024 17:56:42.475 MASTER <-> REPLICA sync started INFO 2024-03-04T17:56:42.476296060Z 1:26:S 04 Mar 2024 17:56:42.476 Non blocking connect for SYNC fired the event. INFO 2024-03-04T17:56:42.476791940Z 1:26:S 04 Mar 2024 17:56:42.476 Master replied to PING, replication can continue... INFO 2024-03-04T17:56:42.477394190Z 1:26:S 04 Mar 2024 17:56:42.477 Trying a partial resynchronization (request 37999384b4c92a7faeb0363896c28e45ff4542f7:1156146762838). INFO 2024-03-04T17:56:42.477686513Z 1:26:S 04 Mar 2024 17:56:42.477 Successful partial resynchronization with master. INFO 2024-03-04T17:56:42.477701352Z 1:26:S 04 Mar 2024 17:56:42.477 * MASTER <-> REPLICA sync: Master accepted a Partial Resynchronization.

air3ijai commented 3 months ago
...
INFO 2024-03-04T17:56:03.771585783Z 1:27:S 04 Mar 2024 17:56:03.771 * Background saving terminated with success
INFO 2024-03-04T17:56:27.911141759Z 1:signal-handler (1709574987) Received SIGTERM scheduling shutdown...
...

Received SIGTERM - so, Pod received a signal to stop and that was probably was done externally by Kubernetes. And just in 24 seconds after backup was finished.

I do remember that we had some issues with the Liveness/Readiness probes when Pod was not able to respond in time and was killed and restarted.

Did you see anything specific to KeyDB Pods in your Kubernetes events?

kubectl get events -n keydb

And they are usually pruned after 1h and need to grabbed to be analyzed later. Probably GKE should provide some logs for such type of events.

agosalvez commented 3 months ago

I didn't see the events when the reset happened. Anyway, the reset occurs every 52 days. I've already had 4 resets every 52 days, that's what surprises me.

I can check the events next time the reset happens. It will be in 30 days.