dragonflydb / dragonfly

A modern replacement for Redis and Memcached
https://www.dragonflydb.io/
Other
25.21k stars 908 forks source link

Updating master on a replica causes errors #1032

Closed Pothulapati closed 1 year ago

Pothulapati commented 1 year ago

Describe the bug A clear and concise description of what the bug is.

When the master instance of a replica is terminated, and a command to update to the master is issued. Sometimes, These commands seem to cause timeouts on the client along with the following errors on the replica.

➜  dragonfly-operator git:(tar/resource-update) ✗ kubectl logs health-test-3
I20230403 07:30:32.091861     1 init.cc:56] dragonfly running in opt mode.
I20230403 07:30:32.091902     1 dfly_main.cc:476] Starting dragonfly df-v1.1.1-9b03b2cfd1499b987975698edb004acaf573ac9a
I20230403 07:30:32.091917     1 dfly_main.cc:499] maxmemory has not been specified. Deciding myself....
I20230403 07:30:32.091960     1 dfly_main.cc:504] Found 23.84GiB available memory. Setting maxmemory to 19.08GiB
I20230403 07:30:32.115703    14 proactor.cc:426] IORing with 1024 entries, allocated 102720 bytes, cq_entries is 2048
I20230403 07:30:32.131925     1 proactor_pool.cc:66] Running 12 io threads
I20230403 07:30:32.145551     1 server_family.cc:419] Data directory is "/data"
I20230403 07:30:32.145905    16 listener_interface.cc:96] sock[39] AcceptServer - listening on port 6379
I20230403 07:30:52.944465    16 replica.cc:563] Started full sync
I20230403 07:30:52.959640    16 replica.cc:582] Full sync finished in 58 ms
I20230403 07:30:52.959893    16 replica.cc:659] Transitioned into stable sync
I20230403 07:30:53.157919    16 replica.cc:678] Exit stable sync
W20230403 07:30:53.157953    16 replica.cc:249] Error full sync system:103 Software caused connection abort
E20230403 07:33:03.635744    16 replica.cc:209] Error connecting system:110
E20230403 07:33:03.756654    21 replica.cc:754] Bad FLOW response -ERR syncid not found
E20230403 07:33:03.756819    25 replica.cc:754] Bad FLOW response -ERR syncid not found
E20230403 07:33:03.756908    24 replica.cc:754] Bad FLOW response -ERR syncid not found
E20230403 07:33:03.757125    17 replica.cc:754] Bad FLOW response -ERR syncid not found
E20230403 07:33:03.757162    14 replica.cc:754] Bad FLOW response -ERR syncid not found
E20230403 07:33:03.757261    19 replica.cc:754] Bad FLOW response -ERR syncid not found
E20230403 07:33:03.757357    18 replica.cc:754] Bad FLOW response -ERR syncid not found
E20230403 07:33:03.757875    22 replica.cc:754] Bad FLOW response -ERR syncid not found
E20230403 07:33:03.758107    15 replica.cc:754] Bad FLOW response -ERR syncid not found
E20230403 07:33:03.758194    16 replica.cc:754] Bad FLOW response -ERR syncid not found
E20230403 07:33:03.759543    23 replica.cc:754] Bad FLOW response -ERR syncid not found
E20230403 07:33:03.759578    20 replica.cc:754] Bad FLOW response -ERR syncid not found
W20230403 07:33:03.759912    17 replica.cc:234] Error syncing generic:125 Operation canceled
W20230403 07:33:03.845757    19 replica.cc:234] Error syncing generic:125 Operation canceled
W20230403 07:33:03.933781    24 replica.cc:234] Error syncing generic:125 Operation canceled
I20230403 07:33:06.562381    20 dflycmd.cc:474] Registered replica 10.244.0.10:6379
I20230403 07:33:06.695430    21 dflycmd.cc:474] Registered replica 10.244.0.7:6379
I20230403 07:33:06.715792    20 dflycmd.cc:291] Started full sync with replica 10.244.0.10:6379
I20230403 07:33:06.733628    20 dflycmd.cc:330] Transitioned into stable sync with replica 10.244.0.10:6379
I20230403 07:33:06.796116    21 dflycmd.cc:291] Started full sync with replica 10.244.0.7:6379
I20230403 07:33:06.799985    21 dflycmd.cc:330] Transitioned into stable sync with replica 10.244.0.7:6379
I20230403 07:33:09.830744    20 dflycmd.cc:513] Disconnecting from replica 10.244.0.10:6379

To Reproduce Steps to reproduce the behavior:

  1. Insert records using command
  2. Query records using command
  3. Scroll down to '....'
  4. See error

Expected behavior A clear and concise description of what you expected to happen. replicaOf commands should work as expected.

Screenshots If applicable, add screenshots to help explain your problem.

royjacobson commented 1 year ago

I could reproduce a large timeout when trying to do replicaof IP PORT and the IP address is unreachable. Our connect() is async but uses the default timeout which is quite large.

However, the replica seems to still be reachable during this period and replies with ERR Can not execute during LOADING to requests for replication from other clients.

@Pothulapati, would that explain the problem you see, or is this not the issue? Could you help me reproduce the issue if it's not the case?

Pothulapati commented 1 year ago

Me and @royjacobson jumped on a call to repro this case, and were successfull.

Here are the logs of replica whose master was being changed:

➜  dragonfly-operator git:(main) ✗ kubectl logs dragonfly-sample-3
I20230404 07:47:08.100098     1 init.cc:56] dragonfly running in opt mode.
I20230404 07:47:08.100158     1 dfly_main.cc:476] Starting dragonfly df-v1.1.1-9b03b2cfd1499b987975698edb004acaf573ac9a
I20230404 07:47:08.100178     1 dfly_main.cc:499] maxmemory has not been specified. Deciding myself....
I20230404 07:47:08.100240     1 dfly_main.cc:504] Found 21.55GiB available memory. Setting maxmemory to 17.24GiB
I20230404 07:47:08.131357    26 proactor.cc:426] IORing with 1024 entries, allocated 102720 bytes, cq_entries is 2048
I20230404 07:47:08.147665     1 proactor_pool.cc:66] Running 12 io threads
I20230404 07:47:08.179448     1 server_family.cc:419] Data directory is "/data"
I20230404 07:47:08.183507    17 listener_interface.cc:96] sock[39] AcceptServer - listening on port 6379
I20230404 07:47:08.337898    15 replica.cc:563] Started full sync
I20230404 07:47:08.341495    15 replica.cc:582] Full sync finished in 53 ms
I20230404 07:47:08.341969    15 replica.cc:659] Transitioned into stable sync
I20230404 07:48:05.791030    15 replica.cc:678] Exit stable sync
W20230404 07:48:05.791108    15 replica.cc:249] Error full sync system:103 Software caused connection abort
E20230404 07:50:15.759387    15 replica.cc:209] Error connecting system:110
I20230404 07:50:15.884474    15 replica.cc:563] Started full sync
I20230404 07:50:15.893606    15 replica.cc:582] Full sync finished in 69 ms
I20230404 07:50:15.893929    15 replica.cc:659] Transitioned into stable sync

and the Operator logs when it ran the command:

2023-04-04T07:48:05Z    INFO    Configuring replication {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-sample-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-sample-0", "reconcileID": "d208b253-d879-4865-af8c-bf45096f19e8"}
2023-04-04T07:48:05Z    INFO    Updating status {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-sample-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-sample-0", "reconcileID": "d208b253-d879-4865-af8c-bf45096f19e8", "phase": "configuring-replication"}
2023-04-04T07:48:05Z    INFO    getting all pods relevant to the instance       {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-sample-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-sample-0", "reconcileID": "d208b253-d879-4865-af8c-bf45096f19e8"}
2023-04-04T07:48:05Z    INFO    Marking pod as master   {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-sample-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-sample-0", "reconcileID": "d208b253-d879-4865-af8c-bf45096f19e8", "podName": "dragonfly-sample-1", "ip": "10.244.0.13"}
2023-04-04T07:48:05Z    INFO    Running SLAVE OF NO ONE command {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-sample-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-sample-0", "reconcileID": "d208b253-d879-4865-af8c-bf45096f19e8", "pod": "dragonfly-sample-1"}
2023-04-04T07:48:05Z    INFO    Marking pod role as master      {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-sample-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-sample-0", "reconcileID": "d208b253-d879-4865-af8c-bf45096f19e8", "pod": "dragonfly-sample-1"}
2023-04-04T07:48:05Z    INFO    Checking pod    {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-sample-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-sample-0", "reconcileID": "d208b253-d879-4865-af8c-bf45096f19e8", "podName": "dragonfly-sample-0", "ip": "10.244.0.12", "status": "Running", "deletiontimestamp": "2023-04-04 07:48:35 +0000 UTC"}
2023-04-04T07:48:05Z    INFO    Checking pod    {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-sample-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-sample-0", "reconcileID": "d208b253-d879-4865-af8c-bf45096f19e8", "podName": "dragonfly-sample-1", "ip": "10.244.0.13", "status": "Running", "deletiontimestamp": "<nil>"}
2023-04-04T07:48:05Z    INFO    Checking pod    {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-sample-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-sample-0", "reconcileID": "d208b253-d879-4865-af8c-bf45096f19e8", "podName": "dragonfly-sample-2", "ip": "10.244.0.14", "status": "Running", "deletiontimestamp": "<nil>"}
2023-04-04T07:48:05Z    INFO    Marking pod as replica  {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-sample-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-sample-0", "reconcileID": "d208b253-d879-4865-af8c-bf45096f19e8", "podName": "dragonfly-sample-2", "ip": "10.244.0.14", "status": "Running"}
2023-04-04T07:48:05Z    INFO    Trying to invoke SLAVE OF command       {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-sample-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-sample-0", "reconcileID": "d208b253-d879-4865-af8c-bf45096f19e8", "pod": "dragonfly-sample-2", "master": "10.244.0.13"}
2023-04-04T07:48:05Z    INFO    Marking pod role as replica     {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-sample-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-sample-0", "reconcileID": "d208b253-d879-4865-af8c-bf45096f19e8", "pod": "dragonfly-sample-2"}
2023-04-04T07:48:05Z    INFO    Checking pod    {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-sample-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-sample-0", "reconcileID": "d208b253-d879-4865-af8c-bf45096f19e8", "podName": "dragonfly-sample-3", "ip": "10.244.0.16", "status": "Running", "deletiontimestamp": "<nil>"}
2023-04-04T07:48:05Z    INFO    Marking pod as replica  {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-sample-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-sample-0", "reconcileID": "d208b253-d879-4865-af8c-bf45096f19e8", "podName": "dragonfly-sample-3", "ip": "10.244.0.16", "status": "Running"}
2023-04-04T07:48:05Z    INFO    Trying to invoke SLAVE OF command       {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-sample-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-sample-0", "reconcileID": "d208b253-d879-4865-af8c-bf45096f19e8", "pod": "dragonfly-sample-3", "master": "10.244.0.13"}
2023-04-04T07:48:08Z    ERROR   Failed to mark pod as replica   {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-sample-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-sample-0", "reconcileID": "d208b253-d879-4865-af8c-bf45096f19e8", "podName": "dragonfly-sample-3", "error": "Error running SLAVE OF command: read tcp 10.244.0.5:55776->10.244.0.16:6379: i/o timeout"}
github.com/dragonflydb/dragonfly-operator/internal/controller.(*DragonflyInstance).configureReplication
        /workspace/internal/controller/dragonfly_instance.go:113
github.com/dragonflydb/dragonfly-operator/internal/controller.(*DfPodLifeCycleReconciler).Reconcile
        /workspace/internal/controller/dragonfly_pod_lifecycle_controller.go:142
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.1/pkg/internal/controller/controller.go:235
Pothulapati commented 1 year ago

@royjacobson, I was able to replicate the error with more logs. Here, as you can see we changed the master on a replica (and the operator tries to run the command multiple times if it does not respond) but still the replica is struck.

➜  dragonfly-operator git:(tar/requeue-no-wait) ✗ kubectl logs dragonfly-sample-0
I20230405 08:00:34.000164     1 init.cc:56] dragonfly running in opt mode.
I20230405 08:00:34.000218     1 dfly_main.cc:476] Starting dragonfly df-ee7705a-ee7705a84d70d4c9ccbfb1cec93304dfed02663d
I20230405 08:00:34.000232     1 dfly_main.cc:499] maxmemory has not been specified. Deciding myself....
I20230405 08:00:34.000283     1 dfly_main.cc:504] Found 20.79GiB available memory. Setting maxmemory to 16.63GiB
I20230405 08:00:34.023273    17 proactor.cc:426] IORing with 1024 entries, allocated 102720 bytes, cq_entries is 2048
I20230405 08:00:34.043428     1 proactor_pool.cc:66] Running 12 io threads
I20230405 08:00:34.061728     1 server_family.cc:419] Data directory is "/data"
I20230405 08:00:34.062249    17 listener_interface.cc:96] sock[39] AcceptServer - listening on port 6379
I20230405 08:00:34.095641    15 server_family.cc:1771] Replicating 10.244.0.8:6379
I20230405 08:00:34.261664    15 replica.cc:577] Started full sync with 10.244.0.8:6379
I20230405 08:00:34.271500    15 replica.cc:596] Full sync finished in 102 ms
I20230405 08:00:34.273396    15 replica.cc:673] Transitioned into stable sync
I20230405 08:00:48.581399    26 server_family.cc:1771] Replicating 10.244.0.10:6379
I20230405 08:00:48.583335    15 replica.cc:692] Exit stable sync
W20230405 08:00:48.583370    15 replica.cc:262] Error full sync with 10.244.0.8:6379 generic:125 Operation canceled
I20230405 08:00:48.682265    26 replica.cc:577] Started full sync with 10.244.0.10:6379
I20230405 08:00:48.684026    26 replica.cc:596] Full sync finished in 32 ms
I20230405 08:00:48.684783    26 replica.cc:673] Transitioned into stable sync
I20230405 08:00:49.348903    16 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:00:52.520097    20 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:00:55.523461    24 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:00:58.527462    15 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:01:01.530014    18 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:01:04.535784    23 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:01:07.545243    26 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:01:10.548717    17 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:01:13.552071    22 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:01:16.556663    25 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:01:19.560676    16 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:01:22.563323    21 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:01:25.566026    24 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:01:28.568840    15 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:01:31.571931    18 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:01:34.590735    23 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:01:37.594620    26 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:01:40.597136    17 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:01:43.599417    22 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:01:46.602995    25 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:01:49.615132    16 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:01:52.652987    21 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:01:55.668295    24 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:01:58.674300    15 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:02:01.679100    18 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:02:04.691690    23 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:02:07.694458    26 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:02:10.698467    17 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:02:13.702162    22 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:02:16.705062    25 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:02:19.708786    16 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:02:22.712658    21 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:02:25.715885    24 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:02:28.717672    15 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:02:31.719825    18 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:02:34.722682    23 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:02:37.725445    26 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:02:40.727339    17 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:02:43.730517    22 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:02:51.853925    26 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:02:55.207504    20 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:02:58.210179    22 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:03:04.210144    16 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:03:07.215842    19 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:03:30.699613    21 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:03:39.170729    16 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:03:42.176081    19 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:03:48.172338    25 server_family.cc:1771] Replicating 10.244.0.7:6379
I20230405 08:03:59.817157    20 server_family.cc:1771] Replicating NO:ONE
I20230405 08:04:04.506585    24 server_family.cc:1771] Replicating NO:ONE
I20230405 08:04:07.513571    26 server_family.cc:1771] Replicating NO:ONE
I20230405 08:04:10.526170    16 server_family.cc:1771] Replicating NO:ONE
I20230405 08:04:13.548205    20 server_family.cc:1771] Replicating NO:ONE
I20230405 08:04:16.549757    22 server_family.cc:1771] Replicating NO:ONE
I20230405 08:04:19.644433    24 server_family.cc:1771] Replicating 10.244.0.12:6379
I20230405 08:04:22.853026    17 server_family.cc:1771] Replicating 10.244.0.12:6379
I20230405 08:04:25.852200    20 server_family.cc:1771] Replicating 10.244.0.12:6379
I20230405 08:04:29.015492    23 server_family.cc:1771] Replicating 10.244.0.12:6379
I20230405 08:04:32.338214    26 server_family.cc:1771] Replicating 10.244.0.12:6379

Running the replication info command also shows that it is struck on the on the initial master.

dragonfly-sample-0, 10.244.0.11: # Replication
role:replica
master_host:10.244.0.10
master_port:6379
master_link_status:down
master_last_io_seconds_ago:358
master_sync_in_progress:0