dragonflydb / dragonfly

A modern replacement for Redis and Memcached
https://www.dragonflydb.io/
Other
25.34k stars 913 forks source link

Replication fails #3726

Open evgenykuzyakov opened 3 days ago

evgenykuzyakov commented 3 days ago

Describe the bug Having a master node and a few replica nodes at version: df-v1.22.2-f223457cf6815a37ae5a9e4cec6972f19a04c50c Replica sync doesn't finish with the following error from master: Replication error: Operation canceled: Context cancelled.

Tonight the sync stopped on all replicas at the same time. Having 221823184 keys total in DB in replicas. Uptime was about 8 weeks. Master node continued to work and serve read/write requests.

Master logs:

I20240917 18:07:17.605152    15 server_family.cc:1048] Load finished, num keys read: 222160007
I20240917 18:08:29.967986    26 dflycmd.cc:667] Registered replica *IP:PORT*
I20240917 18:08:30.166843    26 dflycmd.cc:349] Started sync with replica *IP:PORT*
I20240917 18:10:09.608582    25 dflycmd.cc:110] Disconnecting from replica *IP:PORT*
I20240917 18:10:09.608635    25 dflycmd.cc:657] Replication error: Operation canceled: Context cancelled
...

Replica(s) logs:

I20240917 18:08:29.952622    14 protocol_client.cc:182] Resolved endpoint HOST:PORT to IP:PORT
I20240917 18:08:30.171046    14 replica.cc:566] Started full sync with HOST:PORT
W20240917 18:10:10.876755    14 replica.cc:243] Error syncing with HOST:PORT generic:104 Connection reset by peer
...

Master info:

# Server
redis_version:6.2.11
dragonfly_version:df-v1.22.2
redis_mode:standalone
arch_bits:64
os:Linux 5.15.0-100-generic x86_64
multiplexing_api:epoll
tcp_port:3007
thread_count:16
uptime_in_seconds:1435
uptime_in_days:0

# Clients
connected_clients:122
max_clients:64000
client_read_buffer_bytes:31232
blocked_clients:1
pipeline_queue_length:0

# Memory
used_memory:52858339344
used_memory_human:49.23GiB
used_memory_peak:52858339344
used_memory_peak_human:49.23GiB
fibers_stack_vms:2783920
fibers_count:85
used_memory_rss:54798372864
used_memory_rss_human:51.03GiB
used_memory_peak_rss:70501720064
maxmemory:118111600640
maxmemory_human:110.00GiB
used_memory_lua:0
object_used_memory:36343856408
type_used_memory_string:9192048392
type_used_memory_zset:12129904
type_used_memory_hash:27139678112
table_used_memory:16272298496
num_buckets:30229920
num_entries:222160007
inline_keys:14090521
listpack_blobs:0
listpack_bytes:0
small_string_bytes:9192048392
pipeline_cache_bytes:0
dispatch_queue_bytes:0
dispatch_queue_subscriber_bytes:0
dispatch_queue_peak_bytes:0
client_read_buffer_peak_bytes:35328
tls_bytes:5664
snapshot_serialization_bytes:0
cache_mode:store
maxmemory_policy:noeviction

# Stats
total_connections_received:215
total_commands_processed:360
instantaneous_ops_per_sec:0
total_pipelined_commands:62
total_pipelined_squashed_commands:0
pipelined_latency_usec:47598
total_net_input_bytes:15065
connection_migrations:0
total_net_output_bytes:40409
rdb_save_usec:11154447278
rdb_save_count:18680815
instantaneous_input_kbps:-1
instantaneous_output_kbps:-1
rejected_connections:-1
expired_keys:0
evicted_keys:0
hard_evictions:0
garbage_checked:0
garbage_collected:0
bump_ups:0
stash_unloaded:0
oom_rejections:0
traverse_ttl_sec:0
delete_ttl_sec:0
keyspace_hits:0
keyspace_misses:0
keyspace_mutations:222160008
total_reads_processed:250
total_writes_processed:3179
defrag_attempt_total:0
defrag_realloc_total:0
defrag_task_invocation_total:0
reply_count:3179
reply_latency_usec:0
blocked_on_interpreter:0
lua_interpreter_cnt:0
lua_blocked:0

# Replication
role:master
connected_slaves:0
master_replid:5acda8918168aeb4780d45d08fb97825d9ac88ac

# Modules
module:name=ReJSON,ver=20000,api=1,filters=0,usedby=[search],using=[],options=[handle-io-errors]
module:name=search,ver=20000,api=1,filters=0,usedby=[],using=[ReJSON],options=[handle-io-errors]

# Keyspace
db0:keys=222160007,expires=0,avg_ttl=-1

# Cpu
used_cpu_sys:421.802633
used_cpu_user:1571.433115
used_cpu_sys_children:0.870
used_cpu_user_children:0.1163
used_cpu_sys_main_thread:26.262023
used_cpu_user_main_thread:78.715462

# Cluster
cluster_enabled:0

To Reproduce Steps to reproduce the behavior:

  1. Start master node
  2. Start docker on replica node
  3. See error

Expected behavior Sync completes

Environment (please complete the following information):

Additional context Master node is hosted behind nginx with streams enabled.

The data in the DB is not private, so I might be able to provide a snapshot, but it's around 50Gb

chakaz commented 3 days ago

Hi @evgenykuzyakov Is the error reproducible, or did it happen just once? Could there have been some network issue (or nginx issue) at the time of the connection errors?

evgenykuzyakov commented 2 days ago

Is the error reproducible, or did it happen just once?

It's reproducible every time from every replica.

Will try to replicate without nginx to see if the error happens

evgenykuzyakov commented 1 day ago

Tried replicating directly without nginx with TLS (through ssl) and got the following logs on replica:

dragonfly-1  | I20240919 16:18:03.284178     1 init.cc:78] dragonfly running in opt mode.
dragonfly-1  | I20240919 16:18:03.284281     1 dfly_main.cc:640] Starting dragonfly df-v1.22.2-f223457cf6815a37ae5a9e4cec6972f19a04c50c
dragonfly-1  | * Logs will be written to the first available of the following paths:
dragonfly-1  | /tmp/dragonfly.*
dragonfly-1  | ./dragonfly.*
dragonfly-1  | * For the available flags type dragonfly [--help | --helpfull]
dragonfly-1  | * Documentation can be found at: https://www.dragonflydb.io/docs
dragonfly-1  | I20240919 16:18:03.284484     1 dfly_main.cc:703] Max memory limit is: 110.00GiB
dragonfly-1  | W20240919 16:18:03.284507     1 dfly_main.cc:367] Weird error 1 switching to epoll
dragonfly-1  | I20240919 16:18:03.362095     1 proactor_pool.cc:147] Running 12 io threads
dragonfly-1  | I20240919 16:18:03.365094     1 server_family.cc:783] Host OS: Linux 5.15.0-117-generic x86_64 with 12 threads
dragonfly-1  | I20240919 16:18:03.366552    14 server_family.cc:2598] Replicating *IP:HOST*
dragonfly-1  | I20240919 16:18:03.372794    12 listener_interface.cc:101] sock[27] AcceptServer - listening on port 3007
dragonfly-1  | I20240919 16:18:03.887799    14 replica.cc:566] Started full sync with *IP:HOST*
dragonfly-1  | W20240919 16:18:25.319118    14 replica.cc:243] Error syncing with *IP:HOST* system:103 Software caused connection abort
dragonfly-1  | E20240919 16:18:28.325552    11 rdb_load.cc:716] Ziplist integrity check failed.
dragonfly-1  | E20240919 16:18:28.325608    11 rdb_load.cc:2485] Could not load value for key 'ft_updates' in DB 0
dragonfly-1  | E20240919 16:18:28.325634    11 rdb_load.cc:2485] Could not load value for key 'pk:ed25519:7WqEQjeUTEDarpAEGt3tP8xZMmZ1rnaHMfAzuVJTxiXX' in DB 0
dragonfly-1  | E20240919 16:18:28.327425    15 rdb_load.cc:2485] Could not load value for key 'pk:ed25519:5varyXrCnPxQdGe6SoErjMpnS64SgE8PsG41MpCwLBrd' in DB 0
dragonfly-1  | E20240919 16:18:28.327435    11 rdb_load.cc:2485] Could not load value for key 'ft:jp6isgc0wqve.users.kaiching' in DB 0
dragonfly-1  | E20240919 16:18:28.327427    19 rdb_load.cc:2485] Could not load value for key 'ft:7407068177.tg' in DB 0
dragonfly-1  | E20240919 16:18:28.328070    14 protocol_client.cc:291] Socket error generic:103
dragonfly-1  | W20240919 16:18:28.328837    14 replica.cc:243] Error syncing with *IP:HOST* dragonfly.rdbload:5 Internal error when loading RDB file 5
dragonfly-1  | I20240919 16:18:40.986608    14 replica.cc:566] Started full sync with *IP:HOST*
dragonfly-1  | E20240919 16:18:41.130453    11 rdb_load.cc:716] Ziplist integrity check failed.
dragonfly-1  | E20240919 16:18:41.130522    11 rdb_load.cc:2485] Could not load value for key 'ft_updates' in DB 0
dragonfly-1  | W20240919 16:18:41.135857    14 replica.cc:243] Error syncing with *IP:HOST* dragonfly.rdbload:5 Internal error when loading RDB file 5
dragonfly-1  | I20240919 16:18:51.785612    14 replica.cc:566] Started full sync with *IP:HOST*
dragonfly-1  | E20240919 16:18:52.069803    11 rdb_load.cc:716] Ziplist integrity check failed.
dragonfly-1  | E20240919 16:18:52.069862    11 rdb_load.cc:2485] Could not load value for key 'ft_updates' in DB 0
dragonfly-1  | W20240919 16:18:52.071615    14 replica.cc:243] Error syncing with *IP:HOST* dragonfly.rdbload:5 Internal error when loading RDB file 5
dragonfly-1  | I20240919 16:18:57.775456    14 replica.cc:566] Started full sync with *IP:HOST*
dragonfly-1  | E20240919 16:18:57.917601    11 rdb_load.cc:716] Ziplist integrity check failed.
dragonfly-1  | E20240919 16:18:57.917649    11 rdb_load.cc:2485] Could not load value for key 'ft_updates' in DB 0
dragonfly-1  | E20240919 16:18:57.918133    19 rdb_load.cc:2485] Could not load value for key 'ft:n38nmj8mfpa6.users.kaiching' in DB 0
dragonfly-1  | W20240919 16:18:57.921504    14 replica.cc:243] Error syncing with *IP:HOST* dragonfly.rdbload:5 Internal error when loading RDB file 5
dragonfly-1  | I20240919 16:19:03.427546    14 replica.cc:566] Started full sync with *IP:HOST*
dragonfly-1  | E20240919 16:19:03.555168    11 rdb_load.cc:716] Ziplist integrity check failed.
...

Master logs the same errors as before: Replication error: Operation canceled: Context cancelled

For context: ft_updates is populated and consumed using RPUSH,BLMOVE. Keys pk:* and ft:* are hashmaps.

romange commented 1 day ago

@evgenykuzyakov Please ping me directly on discord