dragonflydb / dragonfly-operator

A Kubernetes operator to install and manage Dragonfly instances.
https://www.dragonflydb.io/docs/managing-dragonfly/operator/installation
Apache License 2.0
144 stars 34 forks source link

Operator not moving master role from a terminating pod #230

Closed sisrael-dn closed 3 months ago

sisrael-dn commented 3 months ago

Hi,

I have a cluster with 3 nodes (VMs), and deployed the operator with 3 replicas using helm chart. In addition, created a dragonfly resource also with 3 replicas and affinity rule so that each replica is created on a different node. All resources were created successfully.

To test HA, I completely stop the node where dragonfly pod with role=master is running on, making in not ready. I want to test a scenario where node can be unavailable for long time (like a power outage on one of the racks). Once node is down, the dragonfly master pod on it moved into Terminating state, and keep this state as long as this node is not responsive (AFAIK - an expected K8s behavior). I don't wish to remove this node form cluster, as it is expected to get back online after a while. But I expect from the operator would detect this status and move the master role to one of the other dragonfly pods on the other nodes (without me need to manually/forcefully remove the terminating pod). However, that doesn't happen unfortunately. I can see in the operator logs that this message repeats itself:

2024-08-18T11:31:51Z    INFO    Received    {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-2","namespace":"app"}, "namespace": "app", "name": "dragonfly-2", "reconcileID": "69ba4cc7-69ee-497c-8a6b-079673acc3da", "pod": {"name":"dragonfly-2","namespace":"app"}}
2024-08-18T11:31:51Z    INFO    Master pod is not ready yet, will requeue   {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-2","namespace":"app"}, "namespace": "app", "name": "dragonfly-2", "reconcileID": "69ba4cc7-69ee-497c-8a6b-079673acc3da", "pod": {"name":"dragonfly-2","namespace":"app"}, "restarts": 0}

So operator is fully aware that pod is not ready, but it just waits and decides not to move the role in this case. Is this expected behavior? my misconfiguration? or a bug?

Thanks, Shay

Abhra303 commented 3 months ago

Hi @sisrael-dn,, its a bug and there is already an issue filed #227.

sisrael-dn commented 3 months ago

Thank you @Abhra303. So closing this as a duplicate & will follow #227.