master and its corresponding replica are running on the same AZ

doronl commented 2 years ago

Hi @NataliAharoniPayu ,

the is a cluster with 3 master and 3 replica nodes:

This is the output of the CLUSTER NODES command:

7a00b8761583afefffd2767a0dfc56ff17ba6da0 20.3.36.217:6379@16379 master - 0 1659388528000 3 connected 10923-16383
7829e21e0f8386b6d3a7cb6fd5ff9c46c6d50df7 20.3.37.165:6379@16379 myself,master - 0 1659388527000 1 connected 0-5460
6171e36f1228fc75b625c83fef9579010ab3e80b 20.3.35.205:6379@16379 master - 0 1659388529508 2 connected 5461-10922
60ea4fc49f2d8c4f339452a767ad9e99f087323c 20.3.35.117:6379@16379 slave 7a00b8761583afefffd2767a0dfc56ff17ba6da0 0 1659388528000 3 connected
4d5c9085df50208baf20d8c50c79ef7cbcd911c6 20.3.37.19:6379@16379 slave 7829e21e0f8386b6d3a7cb6fd5ff9c46c6d50df7 0 1659388528501 1 connected
09e3f8b5b3706912ffa9c6b87270692ceca24e36 20.3.36.87:6379@16379 slave 6171e36f1228fc75b625c83fef9579010ab3e80b 0 1659388528000 2 connected

This the result of script that I run to find out where (in Azure in this case) each pod is running (format pod:podIP node node-AZ)

redis-node-0:20.3.35.19 aks-redcac-27410863-vmss000005 eastus-3
redis-node-0-1:20.3.34.39 aks-redcac-27410863-vmss000003 eastus-1
redis-node-1:20.3.33.59 aks-redcac-27410863-vmss000000 eastus-1
redis-node-1-1:20.3.34.142 aks-redcac-27410863-vmss000004 eastus-2
redis-node-2:20.3.33.72 aks-redcac-27410863-vmss000001 eastus-2
redis-node-2-1:20.3.33.173 aks-redcac-27410863-vmss000002 eastus-3

Attached operator logs

(BTW, I have a 2nd cluster in another namespace that is running correctly)

operator.log

NataliAharoniPayu commented 2 years ago

Hi, I'm trying to follow the case here, according to the script result (node name : node ip : aks node : data center) I do see that master and its replica is separated (node-0 is on eastus-3, node-0-1 is on eastus-1 for example, its also true for the rest of them), And when I try to analyze those links between masters and followers by relying on the query cluster nodes result, I cannot find match between the presented ip addresses and the script result ip addresses... for example 20.3.36.217 appears in the first image and not in the second on and 20.3.34.39 appears in the second one and not the first...

Can you elaborate more about your indications to detect the issue?

doronl commented 2 years ago

Opps sorry, wrong script input... this is the intended one showing the issue...

redis-node-0:20.3.37.165 aks-redmem-11469090-vmss000005 eastus-3
redis-node-0-1:20.3.37.19 aks-redmem-11469090-vmss000004 eastus-2
redis-node-1:20.3.35.205 aks-redmem-11469090-vmss000001 eastus-2
redis-node-1-1:20.3.36.87 aks-redmem-11469090-vmss000002 eastus-3
redis-node-2:20.3.36.217 aks-redmem-11469090-vmss000003 eastus-1
redis-node-2-1:20.3.35.117 aks-redmem-11469090-vmss000000 eastus-1

NataliAharoniPayu commented 2 years ago

I do see that node-2 and node-2-1 chose different k8s-worker-nodes. The hard anti affinity applied on choosingk8s-worker-nodes (this requirement is a blocker to pod creation) The data center (az) anti affinity applied with soft rule (best effort, even if cannot be met -> pod will be created eventually)

doronl commented 2 years ago

We define our nodes resource wise so that each node can fit only one pod. In this node group we have 6 nodes, 2 nodes in each AZ, for the case of preparing for AZ down scenario we need that a master and its corresponding replica will be on different AZs so if an entire AZ is down we can recover the cluster... to you take this into account while creating the replica? i.e. that master AZ != replica AZ?

NataliAharoniPayu commented 2 years ago

yes, that is very good discussion to consider. I will start with the steps that should be taken in order to apply hard anti affinity rule on your fork:

Basic acknowledgment with the terms requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution (k8s docs link)
Basic understanding of the code flow of the method makeRedisPod inside controllers/k8sresources.go (link)
We can see there that the hard rule (requiredPodAffinityTerm) applied for node selection based on the label (of the k8s-node) kubernetes.io/hostname, and the soft rule (preferredPodAffinityTerm) applied for node selection based on label topology.kubernetes.io/zone
You can choose to manage differently to nodes labels or the code flow
You can also handle it without changing anything in the code by allocating more worker nodes and trigger re-allocation of the required pods by editing the configmap redis-cluster-state-map, each node has flag called isUpToDate, setting it to false will cause removal and re creation of them. now there are more workers so there is greater chance to choose nodes differently.

Regard a failure of entire AZ: A very concerning state, and yet less likely to happen than a failure of a worker-node We chose to go for the strategy of separating pods with each other based on the worked-nodes they selecting, as applying hard rule also for the az's requires more complex design that will guaranty the order the pods creation (for example, your scenario: there was no way for node-1-1 who was created one second before node-2-1, to know that choosing worker aks-redmem-11469090-vmss000002 instead of aks-redmem-11469090-vmss000000 will create a problematic future for node-2-1) We attempt some kind of an effort to increase chances for the masters to be separate as possible when we create them first during cluster initiation.

Regard managing the resources during AZ outage:

The soft anti affinity rule is exactly what will save us and allow to create the lost resources again on the available AZ's after allocating more worker nodes on them.
If it is something that likely to happen according to your experience, you can always scale up the followers and increase the chances to survive
Please read recommended section in wiki/design-and-infrastructure about scale failures due to lack of available resource - a scenario that is good to know how to trouble shoot (personal opinion) :)

doronl commented 2 years ago

Thanks for the detailed explanation, I also read the recommended section, great piece of work!

I think that you got things right in regards the requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution in our staging configuration we have only 1 AZ and cluster is able to come up and that is great!

The question is in regards our production environment with 3 AZs where nodes are already allocated, assumption is 1 pod on node (can fit), k8s worker nodes are already available and spread evenly on all AZs

Assuming we have a 3 master pods deployment (like) that is spread correctly across all 3 AZs (thanks to the soft rule on topology.kubernetes.io/zone) to and functioning as a cluster, now we have another deployment (like) of 3 followers (to be) pods that are spread correctly on 3 AZs (thanks to the soft rule on topology.kubernetes.io/zone), the challenge is to make the correct match so that master AZ != follower AZ... technically I think it is possible using the function AddFollower that I see you are using in some recovery case, what do you think?

NataliAharoniPayu commented 2 years ago

Thanks for feedback on our wiki :)

We did have a few days of far discussion regard this one, the conclusions we had were:

Both rules will perform an attempt before binding pod to worker, soft and hard, the only difference is that by the end of this attempt,
- the hard rule will prevent from the pod to be created properly (it won't bind to worker and won't receive ip address)
- the soft rule will eventually allow the pod to be created unfortunately on the non-desired worker.
We designed the operator by relying on the axiomatic assumption that we deploy our cluster with full awareness to the resources, after performing all the calculations of possible and non possible combinations between choosing pods and workers, therefore we didn't add (yet) proper support for a cases when it is not valid assumption.
If there won't be enough workers and all the rules are hard, the operator will be trapped in a loop of attempts to create pods, delete them after timeout (after waiting for them to receive ip) and create them again, pods won't be assigned to workers as long as no available workers appears, and the other parts of the cluster are not receiving the health checks and care they need, as the operator invests effort on newly created pods with the higher priority.
It is harder to mitigate such loop trap after it appears (some times we need to reset the cluster), so the way to prevent this loop is to allow rules to be soft, let the cluster be created properly, and have a good detection mechanism for cases we want to stay aware of.
If you think about it, those two cases are parallel:
- Reset the cluster again and again until the combination that allows all the pods to get created appears (will be needed when you have hard rule)
- Re create only the pods you are concerned about again and again until the right combination appear (in case of soft rule)

Re attempting to deploy in a way that allows all pods get created is equal to re attempting to deploy in a way that fits our standards regard the pod placement.

The case of hard rule is easier to detect but harder to mitigate.I believe each strategy has cons and pros.

doronl commented 2 years ago

OK, Thanks, I will go ahead with these assumptions, will update if anything new pops out, otherwise I will close issues next week.

vineelyalamarthy commented 1 year ago

@NataliAharoniPayu @NataliAharoni99 @DinaYakovlev @voltbit @doronl @drealecs Since this is in open state, I am in a bit of confusion. Is Node Aware Replication working yet or not ?

PayU / redis-operator

master and its corresponding replica are running on the same AZ #152