Open mykhailosmolynets opened 4 years ago
@JohnSully don't see rreplay command in redis docs, it seems to be just keydb feature. For what it is using for in keydb? Assume it's ,replicaReplayCommand. Just want to find root cause of this issue, because it has already twice affected our production. Seems that you have already had some issues with it and added special errors for this command https://github.com/JohnSully/KeyDB/issues/67
from the issue that we had yesterday:
7:13:S 08 Oct 2020 19:35:02.782 # == CRITICAL == This replica is sending an error to its master: 'command did not execute' after processing the command 'rreplay'
7:13:S 08 Oct 2020 19:35:02.782 # Latest backlog is: '"2769\r\n\r\n$1\r\n0\r\n$19\r\n1680013474836512770\r\n*5\r\n$7\r\nRREPLAY\r\n$36\r\na1f375d2-b3c9-4801-aaf6-12c99a436c17\r\n$114\r\n*5\r\n$7\r\nRREPLAY\r\n$36\r\nc829a0bd-e794-4185-aad2-0395dfb91d70\r\n$14\r\n*1\r\n$4\r\nPING\r\n\r\n$1\r\n0\r\n$19\r\n1680013475292643328\r\n\r\n$1\r\n0\r\n$19\r\n1680013475292643329\r\n"'
@mykhailosmolynets did you happen to sort this out, and if how? We are experiencing similar failure sporadically.
Same here... would love to see some movement on resolving this issue.
Log Files == CRITICAL == This replica is sending an error to its master: 'command did not execute' after processing the command 'rreplay
Keydb image: eqalpha/keydb:x86_64_v6.2.1
Experiencing the same issue. I'm wondering whether the commands throwing this error would be important to be replayed or not.
Currently, to avoid spamming and impacting my logs with useless entries, I'm running KeyDB piped into grep
:
keydb-server keydb.conf | grep --line-buffered --invert-match "This server is sending an error to its AOF-loading-client: 'Command must be sent from a master' after processing the command 'rreplay'"
(--line-buffered
flushes the buffer at each EOL, otherwise the output may not be immediate when there is low log traffic)
Can reproduce this error as well on an empty keydb with multi-master and AOF:
version: "3.1"
services:
keydb1:
image: eqalpha/keydb
command: keydb-server /etc/keydb/keydb.conf --replicaof keydb2 6379
ports:
- "6301:6379"
volumes:
- ./keydb.conf:/etc/keydb/keydb.conf
keydb2:
image: eqalpha/keydb
command: keydb-server /etc/keydb/keydb.conf --replicaof keydb2 6379
ports:
- "6302:6379"
volumes:
- ./keydb.conf:/etc/keydb/keydb.conf
protected-mode no
daemonize no
appendonly yes
replica-read-only no
multi-master yes
active-replica yes
On some startups, this error observed on the first node:
# Server initialized
* Reading RDB preamble from AOF file...
* Loading RDB produced by version 6.3.0
* RDB age 22 seconds
* RDB memory usage when created 2.46 Mb
* RDB has an AOF tail
# Done loading RDB, keys loaded: 0, keys expired: 0.
* Reading the remaining AOF tail...
# == CRITICAL == This server is sending an error to its AOF-loading-client: 'Command must be sent from a master' after processing the command 'rreplay'
* DB loaded from append only file: 0.000 seconds
Is it safe to ignore these "critical" errors?
hello, this issue also seems to be on version 6.3.1 I have a multi-master config with AOF "on" and on a restart of the k8s pods, it writes a lot of core.xxx files . I think the problem depends on the AOF , because when setting the appendonly to "no" , it's gone.
I tested a bit around. it looks like an issue with the data inside the cached data (table). We are using the keydb to cache some temporarly access keys. After deleting the data (access keys) from the table, the issue is gone and appendonly "yes" , seems to be working.
@silverwind: you have mentioned, you tested with an empty db ? would you like test with appendonly "no" ?
Is there anyone still experiencing this issue, will be helpful to understand how to prioritize this.
Based on short investigation we deducted out that it was due to multi-master, active-replica configuration that caused it. Rebooting of one of these masters caused a some sort of sync burst between the nodes. If I remember correctly, for us it was seen in 3 > node setups, but not in 2 node setups.
We deducted that we can live without multi master, active-replica setup and moved to master-slave configuration for stability. In larger db use case reverted back to redis.
If I remember correctly, for us it was seen in 3 > node setups, but not in 2 node setups.
I agree with your analysis with one exception: we've only used 2-node clusters until now and we are seeing the same behavior, also with multi-master and active replication. As we've also seen some other weird behavior like data appearing in wrong databases after restore, we might be switching back to Redis as well...
Hi, is there any updates?
i got the same error with 3 node running as multi-master and without appendonly file. I restart one of it and get this:
138:199:S 03 May 2023 10:28:39.573 * MASTER <-> REPLICA sync: Loading DB in memory
138:199:S 03 May 2023 10:28:39.581 * Loading RDB produced by version 6.2.2
138:199:S 03 May 2023 10:28:39.581 * RDB age 12 seconds
138:199:S 03 May 2023 10:28:39.581 * RDB memory usage when created 314.15 Mb
138:199:S 03 May 2023 10:28:39.652 # == WARNING == This replica is rejecting a command from its master: '-LOADING KeyDB is loading the dataset in memory' after processing the command 'KEYDB.MVCCRESTORE'
138:199:S 03 May 2023 10:28:39.652 # Latest backlog is: '"J9.eyJsYX....."'
138:199:S 03 May 2023 10:28:39.652 # Command didn't execute:
138:199:S 03 May 2023 10:28:39.653 # == CRITICAL == This replica is sending an error to its master: 'command did not execute' after processing the command 'rreplay'
and in the end
138:199:S 03 May 2023 10:28:46.430 # Done loading RDB, keys loaded: 614882, keys expired: 0.
138:199:S 03 May 2023 10:28:46.449 * MASTER <-> REPLICA sync: Finished with success
keydb.conf
bind 0.0.0.0
protected-mode no
port 6379
tcp-backlog 8192
timeout 0
tcp-keepalive 60
daemonize no
supervised no
pidfile /var/run/redis_6379.pid
loglevel notice
logfile "/var/log/container/keydb.log"
crash-log-enabled no
crash-memcheck-enabled no
databases 16
always-show-logo no
set-proc-title no
enable-motd no
proc-title-template "{title} {listen-addr} {server-mode}"
save ""
stop-writes-on-bgsave-error no
rdbcompression yes
rdbchecksum yes
dbfilename dump_nc.rdb
rdb-del-sync-files no
dir /keydb
replicaof 10.22.0.203 6379
replicaof 10.22.1.234 6379
replica-serve-stale-data yes
replica-read-only no
repl-diskless-sync no
repl-diskless-sync-delay 5
repl-diskless-load disabled
repl-timeout 600
repl-disable-tcp-nodelay no
repl-backlog-size 100mb
repl-backlog-ttl 3600
replica-priority 100
acllog-max-len 128
maxclients 10000
maxmemory 0
maxmemory-policy noeviction
lazyfree-lazy-eviction no
lazyfree-lazy-expire yes
lazyfree-lazy-server-del no
replica-lazy-flush no
lazyfree-lazy-user-del yes
lazyfree-lazy-user-flush no
oom-score-adj no
oom-score-adj-values 0 200 800
disable-thp yes
appendonly no
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 50
auto-aof-rewrite-min-size 64mb
aof-load-truncated yes
aof-use-rdb-preamble no
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 128
latency-monitor-threshold 0
notify-keyspace-events ""
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-size -2
list-compress-depth 0
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
stream-node-max-bytes 4096
stream-node-max-entries 100
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 1024mb 512mb 120
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
dynamic-hz yes
aof-rewrite-incremental-fsync yes
rdb-save-incremental-fsync yes
jemalloc-bg-thread yes
server-threads 2
server-thread-affinity false
active-replica yes
multi-master yes
as i can see something strange happening while keydb starting with multiply replicas. e.g. in my config i have 2 replicaof entries
replicaof 10.22.0.203 6379
replicaof 10.22.1.234 6379
but keydb using only one
keydb-cli config get replicaof
1) "replicaof"
2) "10.22.1.234 6379"
and info also give me 2 replicas
keydb-cli info replication | grep slave
slave_read_repl_offset:3374103742
slave_repl_offset:3374103742
slave_priority:100
slave_read_only:0
connected_slaves:2
slave0:ip=10.22.1.234,port=6379,state=online,offset=2615368929,lag=1
slave1:ip=10.22.0.128,port=6379,state=online,offset=2615368929,lag=0
And i see problem on missing
replica 10.22.0.203
if i manually add it sync complete without problem
/# keydb-cli replicaof 10.22.0.128 6379
OK
/# keydb-cli config get replicaof
1) "replicaof"
2) "10.22.0.128 6379"
3) "replicaof"
4) "10.22.1.234 6379"
may be it will helps
any ideas? something new: if i trying configure replicas via params of keydb-server not via keydb.conf file they see their replicas e.g.
exec keydb-server /etc/keydb.conf --multi-master yes --active-replica yes --replicaof 10.7.10.74 6380 --replicaof 10.22.0.128 6380 >> /var/log/keydb.log 2>&1
..
:/#keydb-cli role
1) 1) "active-replica"
2) "10.22.0.128"
3) (integer) 6380
4) "connected"
2) 1) "active-replica"
2) "10.7.10.74"
3) (integer) 6380
4) "connected"
:/#keydb-cli config get replicaof
1) "replicaof"
2) "10.22.0.128 6380"
3) "replicaof"
4) "10.7.10.74 6380"
Ran into this same issue 6.3.4 using the Helm chart with multimaster with 3 nodes. Trying to find a solution now. Reducing to 2 nodes for now to see if that fixes it.
UPDATE: Everything is stable in production with 2 nodes, logs are all clean now.
@Jayd603 were you able to find a root cause for this? Still happy with the 2-node setup?
@Jayd603 were you able to find a root cause for this? Still happy with the 2-node setup?
Nope, the same version with 2 nodes is still in production and has been solid. I did read somewhere on here that the keydb/snapchat team isn't interested in working on further multi-master support because they don't use it internally, which is a bummer. I hope that changes or that where ever I read that is incorrect, maybe someone else can chime in.
Thanks for the feedback @Jayd603 . Have you considered alternatives like https://github.com/dragonflydb/dragonfly?
Describe the bug We have 2 keydb pods in active-replica. For some reason one of the pods started sends errors
== CRITICAL == This replica is sending an error to its master: 'command did not execute' after processing the command 'rreplay
.In this moment this pod was near its maxmemory limit(in our case 2gb) and started to send OOM for any new write. However, we have volatile-lru eviction policy and it didn't cleanup memory. As result it was near maxmemory around 4 hours, until we mentioned the issue. The most interesting thing that this pod had just 4 keys and there were nothing to clean, and second alive pod had almost 300 and used just around 700mb of memory. On another pod there were no any errors in logs at all. After pod restart it back to normal and started to use around 150mb after sync with second pod.
Replication config seems to be ok as well:
Log Files
== CRITICAL == This replica is sending an error to its master: 'command did not execute' after processing the command 'rreplay
Keydb image: eqalpha/keydb:x86_64_v6.0.8