Open chuegel opened 2 years ago
+1 What about user sessions (round-robin load balancer on top with no sticky session)?
I think all you need to add to your deployment yaml is replicas: {{ number_of_desired_instances }}
At least it appears to be working for me.
Hey @Michandrz , I am curious about user sessions. Have you seen any issues with increasing the replicas?
I have been trying to increase my AWX instance capacity.
Yeah, I have. I think that is more related to the GKE LB not maintaining session concurrency.
Hi @Michandrz, so i guess the way to make it work is to use sticky sessions (session affinity).
In scenario with external Application Load Balancer (L7), this must be configured on LB.
Finally had a chance to make that change and have my team looking out for the session tracing will report back with an update in a week or so.
Setting replicas via
AWX:
enabled: true
name: awx
spec:
replicas: 2
and enabling sticky sessions in the ingress (e.g. NGINX):
AWX:
spec:
ingress_annotations: |
nginx.ingress.kubernetes.io/affinity: "cookie"
nginx.ingress.kubernetes.io/session-cookie-name: "AWX-SESSION-COOKIE-COM"
nginx.ingress.kubernetes.io/session-cookie-expires: "172800"
nginx.ingress.kubernetes.io/session-cookie-max-age: "172800"
nginx.ingress.kubernetes.io/affinity-mode: persistent
nginx.ingress.kubernetes.io/session-cookie-hash: sha1
results in a stable UE so far for us.
With replicas: 2 and ingress persistent session cookie settings, it's controlled clients connection to specific pod but for jobs execution there is no HA as automation-jobs are sending to the same worker node (i.e: ip-10-207-49-132.ec2.internal).
$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
automation-job-480662-g9dtj 1/1 Running 0 5m41s 10.207.49.188 ip-10-207-49-132.ec2.internal <none> <none>
automation-job-480664-wtfch 1/1 Running 0 5m21s 10.207.49.157 ip-10-207-49-132.ec2.internal <none> <none>
automation-job-480666-rf9qx 1/1 Running 0 18s 10.207.49.154 ip-10-207-49-132.ec2.internal <none> <none>
automation-job-480668-zmvpf 1/1 Running 0 10s 10.207.49.138 ip-10-207-49-132.ec2.internal <none> <none>
awx-eks-6b45f5579b-m95vn 4/4 Running 0 11d 10.207.49.185 ip-10-207-49-132.ec2.internal <none> <none>
awx-eks-6b45f5579b-q5ddh 4/4 Running 0 11d 10.207.49.113 ip-10-207-49-125.ec2.internal <none> <none>
awx-operator-79bc95f78-lhbht 1/1 Running 0 11d 10.207.49.172 ip-10-207-49-132.ec2.internal <none> <none>
$
Is it possible to configure automation-jobs execution to run on multiple Kubernetes nodes for HA/scaling?
Well, that was one long week....It's been running fine without any issues since enabling session affinity.
With replicas: 2 and ingress persistent session cookie settings, it's controlled clients connection to specific pod but for jobs execution there is no HA as automation-jobs are sending to the same worker node (i.e: ip-10-207-49-132.ec2.internal).
$ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES automation-job-480662-g9dtj 1/1 Running 0 5m41s 10.207.49.188 ip-10-207-49-132.ec2.internal <none> <none> automation-job-480664-wtfch 1/1 Running 0 5m21s 10.207.49.157 ip-10-207-49-132.ec2.internal <none> <none> automation-job-480666-rf9qx 1/1 Running 0 18s 10.207.49.154 ip-10-207-49-132.ec2.internal <none> <none> automation-job-480668-zmvpf 1/1 Running 0 10s 10.207.49.138 ip-10-207-49-132.ec2.internal <none> <none> awx-eks-6b45f5579b-m95vn 4/4 Running 0 11d 10.207.49.185 ip-10-207-49-132.ec2.internal <none> <none> awx-eks-6b45f5579b-q5ddh 4/4 Running 0 11d 10.207.49.113 ip-10-207-49-125.ec2.internal <none> <none> awx-operator-79bc95f78-lhbht 1/1 Running 0 11d 10.207.49.172 ip-10-207-49-132.ec2.internal <none> <none> $
Is it possible to configure automation-jobs execution to run on multiple Kubernetes nodes for HA/scaling?
This is controlled by your K8 scheduler. if you go to your instance group in awx, you can specify custom spec overrides and then it's just a matter of using node affinity/anti-affinity
Hi here, what about state of running Jobs? That is, if you run multiple replicas of AWX (e.g. 3) and
are the state of these running Job still managed by the remaining AWX pods?
are the state of these running Job still managed by the remaining AWX pods?
The README states:
During deployment restarts or new rollouts, when old ReplicaSet Pods are being terminated, the corresponding jobs which are managed (executed or controlled) by old AWX Pods may end up in Error state as there is no mechanism to transfer them to the newly spawned AWX Pods.
So terminationGracePeriod is a useful setting as long as there is no way to transfer running jobs to other/new replicas.
ISSUE TYPE
SUMMARY
Does the awx-operator support a HA deployment of AWX with multiple replicas?
Thanks
ENVIRONMENT