Snapchat / KeyDB

A Multithreaded Fork of Redis
https://keydb.dev
BSD 3-Clause "New" or "Revised" License
11.55k stars 579 forks source link

Keydb replica is sending an error to its master #227

Open mykhailosmolynets opened 4 years ago

mykhailosmolynets commented 4 years ago

Describe the bug We have 2 keydb pods in active-replica. For some reason one of the pods started sends errors

== CRITICAL == This replica is sending an error to its master: 'command did not execute' after processing the command 'rreplay.

In this moment this pod was near its maxmemory limit(in our case 2gb) and started to send OOM for any new write. However, we have volatile-lru eviction policy and it didn't cleanup memory. As result it was near maxmemory around 4 hours, until we mentioned the issue. The most interesting thing that this pod had just 4 keys and there were nothing to clean, and second alive pod had almost 300 and used just around 700mb of memory. On another pod there were no any errors in logs at all. After pod restart it back to normal and started to use around 150mb after sync with second pod.

# Memory
used_memory:2319649624
used_memory_human:2.16G
used_memory_rss:2144624640
used_memory_rss_human:2.00G
used_memory_peak:2323309320
used_memory_peak_human:2.16G
used_memory_peak_perc:99.84%
used_memory_overhead:10226950
used_memory_startup:9109968
used_memory_dataset:2309422674
used_memory_dataset_perc:99.95%
allocator_allocated:2320237592
allocator_active:2333794304
allocator_resident:2410905600
total_system_memory:67557654528
total_system_memory_human:62.92G
used_memory_lua:44032
used_memory_lua_human:43.00K
used_memory_scripts:200
used_memory_scripts_human:200B
number_of_cached_scripts:1
maxmemory:2147483648
maxmemory_human:2.00G
maxmemory_policy:volatile-lru
allocator_frag_ratio:1.01
allocator_frag_bytes:13556712
allocator_rss_ratio:1.03
allocator_rss_bytes:77111296
rss_overhead_ratio:0.89
rss_overhead_bytes:-266280960
mem_fragmentation_ratio:0.92
mem_fragmentation_bytes:-174982480
mem_not_counted_for_evict:0
mem_replication_backlog:1048576
mem_clients_slaves:17186
mem_clients_normal:34372
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

Replication config seems to be ok as well:

role:active-replica
master_global_link_status:up
Master 0:
master_host:keydb-1
master_port:6379
master_link_status:up
master_last_io_seconds_ago:6
master_sync_in_progress:0
slave_repl_offset:3301405475576
slave_priority:100
slave_read_only:0
connected_slaves:1
slave0:ip=keydb-1-ip,port=6379,state=online,offset=3927683809290,lag=0
master_replid:b8f99c843eba332f8f4da5aa6979083df56ea236
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:3927683809290
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:3927682760715
repl_backlog_histlen:1048576

Log Files == CRITICAL == This replica is sending an error to its master: 'command did not execute' after processing the command 'rreplay

Keydb image: eqalpha/keydb:x86_64_v6.0.8

mykhailosmolynets commented 4 years ago

@JohnSully don't see rreplay command in redis docs, it seems to be just keydb feature. For what it is using for in keydb? Assume it's ,replicaReplayCommand. Just want to find root cause of this issue, because it has already twice affected our production. Seems that you have already had some issues with it and added special errors for this command https://github.com/JohnSully/KeyDB/issues/67

mykhailosmolynets commented 4 years ago

from the issue that we had yesterday:

7:13:S 08 Oct 2020 19:35:02.782 # == CRITICAL == This replica is sending an error to its master: 'command did not execute' after processing the command 'rreplay'
7:13:S 08 Oct 2020 19:35:02.782 # Latest backlog is: '"2769\r\n\r\n$1\r\n0\r\n$19\r\n1680013474836512770\r\n*5\r\n$7\r\nRREPLAY\r\n$36\r\na1f375d2-b3c9-4801-aaf6-12c99a436c17\r\n$114\r\n*5\r\n$7\r\nRREPLAY\r\n$36\r\nc829a0bd-e794-4185-aad2-0395dfb91d70\r\n$14\r\n*1\r\n$4\r\nPING\r\n\r\n$1\r\n0\r\n$19\r\n1680013475292643328\r\n\r\n$1\r\n0\r\n$19\r\n1680013475292643329\r\n"'
d4gg4d commented 3 years ago

@mykhailosmolynets did you happen to sort this out, and if how? We are experiencing similar failure sporadically.

VagyokC4 commented 2 years ago

Same here... would love to see some movement on resolving this issue.

Log Files == CRITICAL == This replica is sending an error to its master: 'command did not execute' after processing the command 'rreplay

Keydb image: eqalpha/keydb:x86_64_v6.2.1

Namoshek commented 2 years ago

Experiencing the same issue. I'm wondering whether the commands throwing this error would be important to be replayed or not.

Currently, to avoid spamming and impacting my logs with useless entries, I'm running KeyDB piped into grep:

keydb-server keydb.conf | grep --line-buffered --invert-match "This server is sending an error to its AOF-loading-client: 'Command must be sent from a master' after processing the command 'rreplay'"

(--line-buffered flushes the buffer at each EOL, otherwise the output may not be immediate when there is low log traffic)

silverwind commented 2 years ago

Can reproduce this error as well on an empty keydb with multi-master and AOF:

docker-compose.yaml

version: "3.1"

services:
  keydb1:
    image: eqalpha/keydb
    command: keydb-server /etc/keydb/keydb.conf --replicaof keydb2 6379
    ports:
      - "6301:6379"
    volumes:
      - ./keydb.conf:/etc/keydb/keydb.conf

  keydb2:
    image: eqalpha/keydb
    command: keydb-server /etc/keydb/keydb.conf --replicaof keydb2 6379
    ports:
      - "6302:6379"
    volumes:
      - ./keydb.conf:/etc/keydb/keydb.conf

keydb.conf

protected-mode no
daemonize no
appendonly yes
replica-read-only no
multi-master yes
active-replica yes

On some startups, this error observed on the first node:

# Server initialized
* Reading RDB preamble from AOF file...
* Loading RDB produced by version 6.3.0
* RDB age 22 seconds
* RDB memory usage when created 2.46 Mb
* RDB has an AOF tail
# Done loading RDB, keys loaded: 0, keys expired: 0.
* Reading the remaining AOF tail...
# == CRITICAL == This server is sending an error to its AOF-loading-client: 'Command must be sent from a master' after processing the command 'rreplay'
* DB loaded from append only file: 0.000 seconds

Is it safe to ignore these "critical" errors?

ticteam commented 2 years ago

hello, this issue also seems to be on version 6.3.1 I have a multi-master config with AOF "on" and on a restart of the k8s pods, it writes a lot of core.xxx files . I think the problem depends on the AOF , because when setting the appendonly to "no" , it's gone.

ticteam commented 2 years ago

I tested a bit around. it looks like an issue with the data inside the cached data (table). We are using the keydb to cache some temporarly access keys. After deleting the data (access keys) from the table, the issue is gone and appendonly "yes" , seems to be working.

@silverwind: you have mentioned, you tested with an empty db ? would you like test with appendonly "no" ?

msotheeswaran-sc commented 1 year ago

Is there anyone still experiencing this issue, will be helpful to understand how to prioritize this.

d4gg4d commented 1 year ago

Based on short investigation we deducted out that it was due to multi-master, active-replica configuration that caused it. Rebooting of one of these masters caused a some sort of sync burst between the nodes. If I remember correctly, for us it was seen in 3 > node setups, but not in 2 node setups.

We deducted that we can live without multi master, active-replica setup and moved to master-slave configuration for stability. In larger db use case reverted back to redis.

Namoshek commented 1 year ago

If I remember correctly, for us it was seen in 3 > node setups, but not in 2 node setups.

I agree with your analysis with one exception: we've only used 2-node clusters until now and we are seeing the same behavior, also with multi-master and active replication. As we've also seen some other weird behavior like data appearing in wrong databases after restore, we might be switching back to Redis as well...

maxidrom-admin commented 1 year ago

Hi, is there any updates?

artarik commented 1 year ago

i got the same error with 3 node running as multi-master and without appendonly file. I restart one of it and get this:

138:199:S 03 May 2023 10:28:39.573 * MASTER <-> REPLICA sync: Loading DB in memory
138:199:S 03 May 2023 10:28:39.581 * Loading RDB produced by version 6.2.2
138:199:S 03 May 2023 10:28:39.581 * RDB age 12 seconds
138:199:S 03 May 2023 10:28:39.581 * RDB memory usage when created 314.15 Mb
138:199:S 03 May 2023 10:28:39.652 # == WARNING == This replica is rejecting a command from its master: '-LOADING KeyDB is loading the dataset in memory' after processing the command 'KEYDB.MVCCRESTORE'
138:199:S 03 May 2023 10:28:39.652 # Latest backlog is: '"J9.eyJsYX....."'
138:199:S 03 May 2023 10:28:39.652 # Command didn't execute:
138:199:S 03 May 2023 10:28:39.653 # == CRITICAL == This replica is sending an error to its master: 'command did not execute' after processing the command 'rreplay'

and in the end

138:199:S 03 May 2023 10:28:46.430 # Done loading RDB, keys loaded: 614882, keys expired: 0.
138:199:S 03 May 2023 10:28:46.449 * MASTER <-> REPLICA sync: Finished with success

keydb.conf

bind 0.0.0.0
protected-mode no
port 6379
tcp-backlog 8192
timeout 0
tcp-keepalive 60
daemonize no
supervised no
pidfile /var/run/redis_6379.pid
loglevel notice
logfile "/var/log/container/keydb.log"
crash-log-enabled no
crash-memcheck-enabled no
databases 16
always-show-logo no
set-proc-title no
enable-motd no
proc-title-template "{title} {listen-addr} {server-mode}"
save ""
stop-writes-on-bgsave-error no
rdbcompression yes
rdbchecksum yes
dbfilename dump_nc.rdb
rdb-del-sync-files no
dir /keydb
replicaof 10.22.0.203 6379
replicaof 10.22.1.234 6379
replica-serve-stale-data yes
replica-read-only no
repl-diskless-sync no
repl-diskless-sync-delay 5
repl-diskless-load disabled
repl-timeout 600
repl-disable-tcp-nodelay no
repl-backlog-size 100mb
repl-backlog-ttl 3600
replica-priority 100
acllog-max-len 128
maxclients 10000
maxmemory 0
maxmemory-policy noeviction
lazyfree-lazy-eviction no
lazyfree-lazy-expire yes
lazyfree-lazy-server-del no
replica-lazy-flush no
lazyfree-lazy-user-del yes
lazyfree-lazy-user-flush no
oom-score-adj no
oom-score-adj-values 0 200 800
disable-thp yes
appendonly no
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 50
auto-aof-rewrite-min-size 64mb
aof-load-truncated yes
aof-use-rdb-preamble no
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 128
latency-monitor-threshold 0
notify-keyspace-events ""
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-size -2
list-compress-depth 0
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
stream-node-max-bytes 4096
stream-node-max-entries 100
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 1024mb 512mb 120
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
dynamic-hz yes
aof-rewrite-incremental-fsync yes
rdb-save-incremental-fsync yes
jemalloc-bg-thread yes
server-threads 2
server-thread-affinity false
active-replica yes
multi-master yes
artarik commented 1 year ago

as i can see something strange happening while keydb starting with multiply replicas. e.g. in my config i have 2 replicaof entries

replicaof 10.22.0.203 6379
replicaof 10.22.1.234 6379

but keydb using only one

keydb-cli config get replicaof
1) "replicaof"
2) "10.22.1.234 6379"

and info also give me 2 replicas

 keydb-cli info replication | grep slave
slave_read_repl_offset:3374103742
slave_repl_offset:3374103742
slave_priority:100
slave_read_only:0
connected_slaves:2
slave0:ip=10.22.1.234,port=6379,state=online,offset=2615368929,lag=1
slave1:ip=10.22.0.128,port=6379,state=online,offset=2615368929,lag=0

And i see problem on missing replica 10.22.0.203 if i manually add it sync complete without problem

/# keydb-cli replicaof 10.22.0.128 6379
OK
/# keydb-cli config get replicaof
1) "replicaof"
2) "10.22.0.128 6379"
3) "replicaof"
4) "10.22.1.234 6379"

may be it will helps

artarik commented 1 year ago

any ideas? something new: if i trying configure replicas via params of keydb-server not via keydb.conf file they see their replicas e.g.

exec keydb-server /etc/keydb.conf --multi-master yes --active-replica yes --replicaof 10.7.10.74 6380  --replicaof 10.22.0.128 6380 >> /var/log/keydb.log 2>&1
..
:/#keydb-cli role
1) 1) "active-replica"
   2) "10.22.0.128"
   3) (integer) 6380
   4) "connected"
2) 1) "active-replica"
   2) "10.7.10.74"
   3) (integer) 6380
   4) "connected"

:/#keydb-cli config get replicaof
1) "replicaof"
2) "10.22.0.128 6380"
3) "replicaof"
4) "10.7.10.74 6380"
Jayd603 commented 9 months ago

Ran into this same issue 6.3.4 using the Helm chart with multimaster with 3 nodes. Trying to find a solution now. Reducing to 2 nodes for now to see if that fixes it.

UPDATE: Everything is stable in production with 2 nodes, logs are all clean now.

franklouwers commented 1 week ago

@Jayd603 were you able to find a root cause for this? Still happy with the 2-node setup?

Jayd603 commented 1 week ago

@Jayd603 were you able to find a root cause for this? Still happy with the 2-node setup?

Nope, the same version with 2 nodes is still in production and has been solid. I did read somewhere on here that the keydb/snapchat team isn't interested in working on further multi-master support because they don't use it internally, which is a bummer. I hope that changes or that where ever I read that is incorrect, maybe someone else can chime in.

franklouwers commented 1 week ago

Thanks for the feedback @Jayd603 . Have you considered alternatives like https://github.com/dragonflydb/dragonfly?