Snapchat / KeyDB

A Multithreaded Fork of Redis
https://keydb.dev
BSD 3-Clause "New" or "Revised" License
11.02k stars 564 forks source link

[BUG] 2-node active-replica busy loop with XREADGROUP #778

Closed hessu closed 5 months ago

hessu commented 5 months ago

Describe the bug

Right after starting to use the Streams feature (single stream, low traffic) with a consumer group, the 2-node active-active replica pair went into a busy loop, where the two nodes exchange a lot of traffic.

To reproduce

Reproduced on KeyDB 6.0.16 and 6.3.4 as follows:

Set up a two-node keydb active-active service, for example with two keydb processes on the same node. Example configs included in the end.

Attach a client, and run the following commands:

$ keydb-cli -p 3518
127.0.0.1:3519> XGROUP CREATE streams.pushnotify maps-notifier $ MKSTREAM
OK
127.0.0.1:3519> XADD streams.pushnotify MAXLEN ~ 1000 * m messagebody
"1706468335614-0"
127.0.0.1:3519> XREADGROUP GROUP maps-notifier c1 BLOCK 30000 COUNT 1 STREAMS streams.pushnotify >
1) 1) "streams.pushnotify"
   2) 1) 1) "1706468335614-0"
         2) 1) "m"
            2) "messagebody"

The stream works, but the two keydb processes are now using a lot of CPU (50% each when running on the same 1-CPU node, 200%/100% in my production environment), and communicating over network at a high rate. This starts when the XREADGROUP command is executed; I'm guessing that the consumer group update starts looping between the nodes.

Restarting the second keydb (where the commands were not run) fixes it until XADD + XREADGROUP is repeated.

Expected behavior

CPU use does not go through the roof.

Additional information

Example config files for two keydb processes:

bind 127.0.0.1
port 3518

active-replica yes
replicaof 127.0.0.1 3519

# Database settings
dbfilename keydb-1.rdb
maxmemory 200m
maxmemory-policy volatile-lru
dir tmp/keydb
save 900 1

server-threads 4

## Daemon
daemonize no
supervised no
loglevel notice
pidfile tmp/keydb/keydb-1.pid

# log to stdout for systemd
logfile ""

and

bind 127.0.0.1
port 3519

active-replica yes
replicaof 127.0.0.1 3518

# Database settings
dbfilename keydb-2.rdb
maxmemory 200m
maxmemory-policy volatile-lru
dir tmp/keydb
save 900 1

server-threads 4

## Daemon
daemonize no
supervised no
loglevel notice
pidfile tmp/keydb/keydb-2.pid

# log to stdout for systemd
logfile ""
hessu commented 5 months ago

Oops, this is fixed in v6.3.4. Previously, when supposedly testing with 6.3.4 I was accidentally running the 6.0.16 binary after all. Closing as invalid.