Open evgenykuzyakov opened 3 days ago
Hi @evgenykuzyakov Is the error reproducible, or did it happen just once? Could there have been some network issue (or nginx issue) at the time of the connection errors?
Is the error reproducible, or did it happen just once?
It's reproducible every time from every replica.
Will try to replicate without nginx to see if the error happens
Tried replicating directly without nginx with TLS (through ssl) and got the following logs on replica:
dragonfly-1 | I20240919 16:18:03.284178 1 init.cc:78] dragonfly running in opt mode.
dragonfly-1 | I20240919 16:18:03.284281 1 dfly_main.cc:640] Starting dragonfly df-v1.22.2-f223457cf6815a37ae5a9e4cec6972f19a04c50c
dragonfly-1 | * Logs will be written to the first available of the following paths:
dragonfly-1 | /tmp/dragonfly.*
dragonfly-1 | ./dragonfly.*
dragonfly-1 | * For the available flags type dragonfly [--help | --helpfull]
dragonfly-1 | * Documentation can be found at: https://www.dragonflydb.io/docs
dragonfly-1 | I20240919 16:18:03.284484 1 dfly_main.cc:703] Max memory limit is: 110.00GiB
dragonfly-1 | W20240919 16:18:03.284507 1 dfly_main.cc:367] Weird error 1 switching to epoll
dragonfly-1 | I20240919 16:18:03.362095 1 proactor_pool.cc:147] Running 12 io threads
dragonfly-1 | I20240919 16:18:03.365094 1 server_family.cc:783] Host OS: Linux 5.15.0-117-generic x86_64 with 12 threads
dragonfly-1 | I20240919 16:18:03.366552 14 server_family.cc:2598] Replicating *IP:HOST*
dragonfly-1 | I20240919 16:18:03.372794 12 listener_interface.cc:101] sock[27] AcceptServer - listening on port 3007
dragonfly-1 | I20240919 16:18:03.887799 14 replica.cc:566] Started full sync with *IP:HOST*
dragonfly-1 | W20240919 16:18:25.319118 14 replica.cc:243] Error syncing with *IP:HOST* system:103 Software caused connection abort
dragonfly-1 | E20240919 16:18:28.325552 11 rdb_load.cc:716] Ziplist integrity check failed.
dragonfly-1 | E20240919 16:18:28.325608 11 rdb_load.cc:2485] Could not load value for key 'ft_updates' in DB 0
dragonfly-1 | E20240919 16:18:28.325634 11 rdb_load.cc:2485] Could not load value for key 'pk:ed25519:7WqEQjeUTEDarpAEGt3tP8xZMmZ1rnaHMfAzuVJTxiXX' in DB 0
dragonfly-1 | E20240919 16:18:28.327425 15 rdb_load.cc:2485] Could not load value for key 'pk:ed25519:5varyXrCnPxQdGe6SoErjMpnS64SgE8PsG41MpCwLBrd' in DB 0
dragonfly-1 | E20240919 16:18:28.327435 11 rdb_load.cc:2485] Could not load value for key 'ft:jp6isgc0wqve.users.kaiching' in DB 0
dragonfly-1 | E20240919 16:18:28.327427 19 rdb_load.cc:2485] Could not load value for key 'ft:7407068177.tg' in DB 0
dragonfly-1 | E20240919 16:18:28.328070 14 protocol_client.cc:291] Socket error generic:103
dragonfly-1 | W20240919 16:18:28.328837 14 replica.cc:243] Error syncing with *IP:HOST* dragonfly.rdbload:5 Internal error when loading RDB file 5
dragonfly-1 | I20240919 16:18:40.986608 14 replica.cc:566] Started full sync with *IP:HOST*
dragonfly-1 | E20240919 16:18:41.130453 11 rdb_load.cc:716] Ziplist integrity check failed.
dragonfly-1 | E20240919 16:18:41.130522 11 rdb_load.cc:2485] Could not load value for key 'ft_updates' in DB 0
dragonfly-1 | W20240919 16:18:41.135857 14 replica.cc:243] Error syncing with *IP:HOST* dragonfly.rdbload:5 Internal error when loading RDB file 5
dragonfly-1 | I20240919 16:18:51.785612 14 replica.cc:566] Started full sync with *IP:HOST*
dragonfly-1 | E20240919 16:18:52.069803 11 rdb_load.cc:716] Ziplist integrity check failed.
dragonfly-1 | E20240919 16:18:52.069862 11 rdb_load.cc:2485] Could not load value for key 'ft_updates' in DB 0
dragonfly-1 | W20240919 16:18:52.071615 14 replica.cc:243] Error syncing with *IP:HOST* dragonfly.rdbload:5 Internal error when loading RDB file 5
dragonfly-1 | I20240919 16:18:57.775456 14 replica.cc:566] Started full sync with *IP:HOST*
dragonfly-1 | E20240919 16:18:57.917601 11 rdb_load.cc:716] Ziplist integrity check failed.
dragonfly-1 | E20240919 16:18:57.917649 11 rdb_load.cc:2485] Could not load value for key 'ft_updates' in DB 0
dragonfly-1 | E20240919 16:18:57.918133 19 rdb_load.cc:2485] Could not load value for key 'ft:n38nmj8mfpa6.users.kaiching' in DB 0
dragonfly-1 | W20240919 16:18:57.921504 14 replica.cc:243] Error syncing with *IP:HOST* dragonfly.rdbload:5 Internal error when loading RDB file 5
dragonfly-1 | I20240919 16:19:03.427546 14 replica.cc:566] Started full sync with *IP:HOST*
dragonfly-1 | E20240919 16:19:03.555168 11 rdb_load.cc:716] Ziplist integrity check failed.
...
Master logs the same errors as before: Replication error: Operation canceled: Context cancelled
For context: ft_updates
is populated and consumed using RPUSH
,BLMOVE
. Keys pk:*
and ft:*
are hashmaps.
@evgenykuzyakov Please ping me directly on discord
Describe the bug Having a master node and a few replica nodes at version: df-v1.22.2-f223457cf6815a37ae5a9e4cec6972f19a04c50c Replica sync doesn't finish with the following error from master:
Replication error: Operation canceled: Context cancelled
.Tonight the sync stopped on all replicas at the same time. Having 221823184 keys total in DB in replicas. Uptime was about 8 weeks. Master node continued to work and serve read/write requests.
Master logs:
Replica(s) logs:
Master info:
To Reproduce Steps to reproduce the behavior:
Expected behavior Sync completes
Environment (please complete the following information):
Linux dragon-1 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Additional context Master node is hosted behind nginx with streams enabled.
The data in the DB is not private, so I might be able to provide a snapshot, but it's around 50Gb