Open agosalvez opened 9 months ago
Hey this is pretty interesting, I think we do deploys much less often than this so I haven't seen it.
Do you have logs from the time just before it restarts?
No, because the log contains a lot of text, but it was just a clean signal signal."
I have a custom Docker image on Ubuntu 20.04, with KeyDB 6.3.1 and the Redis reJSON module at version 2.0.11.
Is it possible for the OS to produce this SIGTERM? I've been searching for something related to it on the Internet for the last few days, but haven't found anything. Could it be the reJSON module attached to KeyDB?
Thanks you in advance John!
Can you check GKE logs and Pod termination reason?
Which liveness, readiness probes do you use?
Yes, of course!
I have a keydb cluster with 3 pods in statefulset.
This is my pod0, I have keydb-0, keydb-1 and keydb-2.
The pods it was restarted similar mode in cascade, first keydb-0, keydb-1 and keydb-2 in this order.
I think maybe it could be cause by SO from container.... I do not know exactly why, I have not read that unix system can be restart by itself. I do not know that to think.
My log from keydb-0:
INFO 2024-03-04T17:55:58.116242977Z 1:27:S 04 Mar 2024 17:55:58.058 1 changes in 900 seconds. Saving...
INFO 2024-03-04T17:55:58.116277976Z 1:27:S 04 Mar 2024 17:55:58.094 Background saving started
INFO 2024-03-04T17:56:03.682936222Z 1:3903722:S 04 Mar 2024 17:56:03.682 DB saved on disk
INFO 2024-03-04T17:56:03.701375413Z 1:3903722:S 04 Mar 2024 17:56:03.701 RDB: 823 MB of memory used by copy-on-write
INFO 2024-03-04T17:56:03.771585783Z 1:27:S 04 Mar 2024 17:56:03.771 Background saving terminated with success
INFO 2024-03-04T17:56:27.911141759Z 1:signal-handler (1709574987) Received SIGTERM scheduling shutdown...
INFO 2024-03-04T17:56:27.978116158Z NOTICE: Detuning locks due to high load per core: 120.28%
INFO 2024-03-04T17:56:27.978166246Z 1:27:S 04 Mar 2024 17:56:27.977 # User requested shutdown...
INFO 2024-03-04T17:56:27.978175634Z 1:27:S 04 Mar 2024 17:56:27.977 Saving the final RDB snapshot before exiting.
ERROR 2024-03-04T17:56:33.112669289Z [resource.labels.containerName: redis-exporter] time="2024-03-04T17:56:33Z" level=error msg="Couldn't connect to redis instance"
INFO 2024-03-04T17:56:33.548772770Z 1:27:S 04 Mar 2024 17:56:33.548 DB saved on disk
INFO 2024-03-04T17:56:33.575917652Z 1:27:S 04 Mar 2024 17:56:33.575 Removing the pid file.
INFO 2024-03-04T17:56:33.730320281Z 1:27:S 04 Mar 2024 17:56:33.730 # KeyDB is now ready to exit, bye bye...
ERROR 2024-03-04T17:56:35.263807646Z ++ hostname
ERROR 2024-03-04T17:56:35.272187215Z + host=keydb-0
ERROR 2024-03-04T17:56:35.480011077Z + port=6379
ERROR 2024-03-04T17:56:35.480030991Z + replicas=()
ERROR 2024-03-04T17:56:35.480037601Z + for node in {0..2}
ERROR 2024-03-04T17:56:35.480085599Z + '[' keydb-0 '!=' keydb-0 ']'
ERROR 2024-03-04T17:56:35.480093943Z + for node in {0..2}
ERROR 2024-03-04T17:56:35.480115716Z + '[' keydb-0 '!=' keydb-1 ']'
ERROR 2024-03-04T17:56:35.480122399Z + replicas+=("--replicaof keydb-${node}.keydb-headless ${port}")
ERROR 2024-03-04T17:56:35.480127517Z + for node in {0..2}
ERROR 2024-03-04T17:56:35.480132612Z + '[' keydb-0 '!=' keydb-2 ']'
ERROR 2024-03-04T17:56:35.480137826Z + replicas+=("--replicaof keydb-${node}.keydb-headless ${port}")
ERROR 2024-03-04T17:56:35.480145280Z + exec keydb-server /etc/keydb/redis.conf --active-replica yes --multi-master yes --appendonly no --bind 0.0.0.0 --port 6379 --protected-mode no --requirepass bots --masterauth bots --server-threads 2 --client-output-buffer-limit replica 1024mb 1024mb 0 --client-output-buffer-limit pubsub 1024mb 1024mb 0 --loadmodule /usr/lib/redis/modules/librejson.so --save '' --save 900 1 --repl-backlog-size 50mb '--replicaof keydb-1.keydb-headless 6379' '--replicaof keydb-2.keydb-headless 6379'
INFO 2024-03-04T17:56:35.722760595Z 1:1:C 04 Mar 2024 17:56:35.722 Notice: "active-replica yes" implies "replica-read-only no"
INFO 2024-03-04T17:56:35.722793703Z 1:1:C 04 Mar 2024 17:56:35.722 Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
INFO 2024-03-04T17:56:35.722825503Z 1:1:C 04 Mar 2024 17:56:35.722 # oO0OoO0OoO0Oo KeyDB is starting oO0OoO0OoO0Oo
INFO 2024-03-04T17:56:35.722836370Z 1:1:C 04 Mar 2024 17:56:35.722 # KeyDB version=6.3.1, bits=64, commit=ee16abf0, modified=1, pid=1, just started
INFO 2024-03-04T17:56:35.722840820Z 1:1:C 04 Mar 2024 17:56:35.722 # Configuration loaded
INFO 2024-03-04T17:56:35.724486510Z 1:1:S 04 Mar 2024 17:56:35.724 monotonic clock: POSIX clockgettime
INFO 2024-03-04T17:56:36.255330155Z
INFO 2024-03-04T17:56:36.255407292Z
INFO 2024-03-04T17:56:36.255415081Z -(+)-
INFO 2024-03-04T17:56:36.255420484Z -- / \ --
INFO 2024-03-04T17:56:36.255425835Z -- / \ -- KeyDB 6.3.1 (ee16abf0/1) 64 bit
INFO 2024-03-04T17:56:36.255431248Z -- / \ --
INFO 2024-03-04T17:56:36.255437953Z (+) / \ (+) Running in standalone mode
INFO 2024-03-04T17:56:36.255444862Z | -- / \ -- | Port: 6379
INFO 2024-03-04T17:56:36.255451976Z | /-- --\ | PID: 1
INFO 2024-03-04T17:56:36.255458258Z | / -(+)- \ |
INFO 2024-03-04T17:56:36.255464148Z | / | \ | https://docs.keydb.dev
INFO 2024-03-04T17:56:36.255470927Z | / | \ |
INFO 2024-03-04T17:56:36.255476851Z | / | \ |
INFO 2024-03-04T17:56:36.255482357Z (+) -- -- -- | -- -- -- (+)
INFO 2024-03-04T17:56:36.255488019Z -- | --
INFO 2024-03-04T17:56:36.255493364Z -- | _--
INFO 2024-03-04T17:56:36.255499645Z -(+)- KeyDB has now joined Snap! See the announcement at: https://docs.keydb.dev/news
INFO 2024-03-04T17:56:36.255509902Z {}
INFO 2024-03-04T17:56:36.255532142Z
INFO 2024-03-04T17:56:36.255537903Z 1:1:S 04 Mar 2024 17:56:36.255 # Server initialized
INFO 2024-03-04T17:56:36.379600093Z 1:1:S 04 Mar 2024 17:56:36.379
...
INFO 2024-03-04T17:56:03.771585783Z 1:27:S 04 Mar 2024 17:56:03.771 * Background saving terminated with success
INFO 2024-03-04T17:56:27.911141759Z 1:signal-handler (1709574987) Received SIGTERM scheduling shutdown...
...
Received SIGTERM
- so, Pod received a signal to stop and that was probably was done externally by Kubernetes. And just in 24 seconds after backup was finished.
I do remember that we had some issues with the Liveness/Readiness probes when Pod was not able to respond in time and was killed and restarted.
Did you see anything specific to KeyDB Pods in your Kubernetes events?
kubectl get events -n keydb
And they are usually pruned after 1h and need to grabbed to be analyzed later. Probably GKE should provide some logs for such type of events.
I didn't see the events when the reset happened. Anyway, the reset occurs every 52 days. I've already had 4 resets every 52 days, that's what surprises me.
I can check the events next time the reset happens. It will be in 30 days.
Describe the bug
I have a keydb cluster with 3 replicas deployed in kubernetes for more than 1 year. Every once in a while it would restart without knowing why until I discovered a pattern, every 52 days exactly and I don't know why.
To reproduce
Run the cluster and wait 52 days.
Expected behavior
restart keydb
Additional information
I don't have any cron task or anything, I've only added configuration at runtime and no parameters have to do with restart.
@JohnSully could you tell me if keydb has any variable or anything for restarting after 52 days? I thought can be a memory insufficient or something like that so the cluster is deployed in GKE. Since july 2023, I realized that every 52 days restarts, keydb works good after that but I would like to know why.
Thank you!
my server.sh file: