I have a cluster with 3 nodes (VMs), and deployed the operator with 3 replicas using helm chart.
In addition, created a dragonfly resource also with 3 replicas and affinity rule so that each replica is created on a different node.
All resources were created successfully.
To test HA, I completely stop the node where dragonfly pod with role=master is running on, making in not ready.
I want to test a scenario where node can be unavailable for long time (like a power outage on one of the racks).
Once node is down, the dragonfly master pod on it moved into Terminating state, and keep this state as long as this node is not responsive (AFAIK - an expected K8s behavior).
I don't wish to remove this node form cluster, as it is expected to get back online after a while.
But I expect from the operator would detect this status and move the master role to one of the other dragonfly pods on the other nodes (without me need to manually/forcefully remove the terminating pod).
However, that doesn't happen unfortunately. I can see in the operator logs that this message repeats itself:
2024-08-18T11:31:51Z INFO Received {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-2","namespace":"app"}, "namespace": "app", "name": "dragonfly-2", "reconcileID": "69ba4cc7-69ee-497c-8a6b-079673acc3da", "pod": {"name":"dragonfly-2","namespace":"app"}}
2024-08-18T11:31:51Z INFO Master pod is not ready yet, will requeue {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-2","namespace":"app"}, "namespace": "app", "name": "dragonfly-2", "reconcileID": "69ba4cc7-69ee-497c-8a6b-079673acc3da", "pod": {"name":"dragonfly-2","namespace":"app"}, "restarts": 0}
So operator is fully aware that pod is not ready, but it just waits and decides not to move the role in this case.
Is this expected behavior? my misconfiguration? or a bug?
Hi,
I have a cluster with 3 nodes (VMs), and deployed the operator with 3 replicas using helm chart. In addition, created a dragonfly resource also with 3 replicas and affinity rule so that each replica is created on a different node. All resources were created successfully.
To test HA, I completely stop the node where dragonfly pod with
role=master
is running on, making in not ready. I want to test a scenario where node can be unavailable for long time (like a power outage on one of the racks). Once node is down, the dragonfly master pod on it moved intoTerminating
state, and keep this state as long as this node is not responsive (AFAIK - an expected K8s behavior). I don't wish to remove this node form cluster, as it is expected to get back online after a while. But I expect from the operator would detect this status and move the master role to one of the other dragonfly pods on the other nodes (without me need to manually/forcefully remove the terminating pod). However, that doesn't happen unfortunately. I can see in the operator logs that this message repeats itself:So operator is fully aware that pod is not ready, but it just waits and decides not to move the role in this case. Is this expected behavior? my misconfiguration? or a bug?
Thanks, Shay