PayU / redis-operator

Kubernetes Redis operator that creates and manages a clustered Redis database.
Apache License 2.0
43 stars 16 forks source link

master and its corresponding replica are running on the same AZ #152

Open doronl opened 2 years ago

doronl commented 2 years ago

Hi @NataliAharoniPayu ,

the is a cluster with 3 master and 3 replica nodes:

This is the output of the CLUSTER NODES command:

7a00b8761583afefffd2767a0dfc56ff17ba6da0 20.3.36.217:6379@16379 master - 0 1659388528000 3 connected 10923-16383
7829e21e0f8386b6d3a7cb6fd5ff9c46c6d50df7 20.3.37.165:6379@16379 myself,master - 0 1659388527000 1 connected 0-5460
6171e36f1228fc75b625c83fef9579010ab3e80b 20.3.35.205:6379@16379 master - 0 1659388529508 2 connected 5461-10922
60ea4fc49f2d8c4f339452a767ad9e99f087323c 20.3.35.117:6379@16379 slave 7a00b8761583afefffd2767a0dfc56ff17ba6da0 0 1659388528000 3 connected
4d5c9085df50208baf20d8c50c79ef7cbcd911c6 20.3.37.19:6379@16379 slave 7829e21e0f8386b6d3a7cb6fd5ff9c46c6d50df7 0 1659388528501 1 connected
09e3f8b5b3706912ffa9c6b87270692ceca24e36 20.3.36.87:6379@16379 slave 6171e36f1228fc75b625c83fef9579010ab3e80b 0 1659388528000 2 connected

This the result of script that I run to find out where (in Azure in this case) each pod is running (format pod:podIP node node-AZ)

redis-node-0:20.3.35.19 aks-redcac-27410863-vmss000005 eastus-3
redis-node-0-1:20.3.34.39 aks-redcac-27410863-vmss000003 eastus-1
redis-node-1:20.3.33.59 aks-redcac-27410863-vmss000000 eastus-1
redis-node-1-1:20.3.34.142 aks-redcac-27410863-vmss000004 eastus-2
redis-node-2:20.3.33.72 aks-redcac-27410863-vmss000001 eastus-2
redis-node-2-1:20.3.33.173 aks-redcac-27410863-vmss000002 eastus-3

Attached operator logs

(BTW, I have a 2nd cluster in another namespace that is running correctly)

operator.log

NataliAharoniPayu commented 2 years ago

Hi, I'm trying to follow the case here, according to the script result (node name : node ip : aks node : data center) I do see that master and its replica is separated (node-0 is on eastus-3, node-0-1 is on eastus-1 for example, its also true for the rest of them), And when I try to analyze those links between masters and followers by relying on the query cluster nodes result, I cannot find match between the presented ip addresses and the script result ip addresses... for example 20.3.36.217 appears in the first image and not in the second on and 20.3.34.39 appears in the second one and not the first...

Can you elaborate more about your indications to detect the issue?

doronl commented 2 years ago

Opps sorry, wrong script input... this is the intended one showing the issue...

redis-node-0:20.3.37.165 aks-redmem-11469090-vmss000005 eastus-3
redis-node-0-1:20.3.37.19 aks-redmem-11469090-vmss000004 eastus-2
redis-node-1:20.3.35.205 aks-redmem-11469090-vmss000001 eastus-2
redis-node-1-1:20.3.36.87 aks-redmem-11469090-vmss000002 eastus-3
redis-node-2:20.3.36.217 aks-redmem-11469090-vmss000003 eastus-1
redis-node-2-1:20.3.35.117 aks-redmem-11469090-vmss000000 eastus-1
NataliAharoniPayu commented 2 years ago

I do see that node-2 and node-2-1 chose different k8s-worker-nodes. The hard anti affinity applied on choosingk8s-worker-nodes (this requirement is a blocker to pod creation) The data center (az) anti affinity applied with soft rule (best effort, even if cannot be met -> pod will be created eventually)

doronl commented 2 years ago

We define our nodes resource wise so that each node can fit only one pod. In this node group we have 6 nodes, 2 nodes in each AZ, for the case of preparing for AZ down scenario we need that a master and its corresponding replica will be on different AZs so if an entire AZ is down we can recover the cluster... to you take this into account while creating the replica? i.e. that master AZ != replica AZ?

NataliAharoniPayu commented 2 years ago

yes, that is very good discussion to consider. I will start with the steps that should be taken in order to apply hard anti affinity rule on your fork:

Regard a failure of entire AZ: A very concerning state, and yet less likely to happen than a failure of a worker-node We chose to go for the strategy of separating pods with each other based on the worked-nodes they selecting, as applying hard rule also for the az's requires more complex design that will guaranty the order the pods creation (for example, your scenario: there was no way for node-1-1 who was created one second before node-2-1, to know that choosing worker aks-redmem-11469090-vmss000002 instead of aks-redmem-11469090-vmss000000 will create a problematic future for node-2-1) We attempt some kind of an effort to increase chances for the masters to be separate as possible when we create them first during cluster initiation.

Regard managing the resources during AZ outage:

doronl commented 2 years ago

Thanks for the detailed explanation, I also read the recommended section, great piece of work!

I think that you got things right in regards the requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution in our staging configuration we have only 1 AZ and cluster is able to come up and that is great!

The question is in regards our production environment with 3 AZs where nodes are already allocated, assumption is 1 pod on node (can fit), k8s worker nodes are already available and spread evenly on all AZs

Assuming we have a 3 master pods deployment (like) that is spread correctly across all 3 AZs (thanks to the soft rule on topology.kubernetes.io/zone) to and functioning as a cluster, now we have another deployment (like) of 3 followers (to be) pods that are spread correctly on 3 AZs (thanks to the soft rule on topology.kubernetes.io/zone), the challenge is to make the correct match so that master AZ != follower AZ... technically I think it is possible using the function AddFollower that I see you are using in some recovery case, what do you think?

NataliAharoniPayu commented 2 years ago

Thanks for feedback on our wiki :)

We did have a few days of far discussion regard this one, the conclusions we had were:

Re attempting to deploy in a way that allows all pods get created is equal to re attempting to deploy in a way that fits our standards regard the pod placement.

The case of hard rule is easier to detect but harder to mitigate.I believe each strategy has cons and pros.

doronl commented 2 years ago

OK, Thanks, I will go ahead with these assumptions, will update if anything new pops out, otherwise I will close issues next week.

vineelyalamarthy commented 1 year ago

@NataliAharoniPayu @NataliAharoni99 @DinaYakovlev @voltbit @doronl @drealecs Since this is in open state, I am in a bit of confusion. Is Node Aware Replication working yet or not ?