Q: Does the awx-operator support multiple replicas (HA mode)?

chuegel commented 2 years ago

ISSUE TYPE

Question

SUMMARY

Does the awx-operator support a HA deployment of AWX with multiple replicas?

Thanks

ENVIRONMENT

AWX version: 21.0.0
Operator version: 21.0.0
Kubernetes version: 1.21
AWX install method: kustomize

kladiv commented 2 years ago

+1 What about user sessions (round-robin load balancer on top with no sticky session)?

Michandrz commented 2 years ago

I think all you need to add to your deployment yaml is replicas: {{ number_of_desired_instances }}

At least it appears to be working for me.

JoseThen commented 2 years ago

Hey @Michandrz , I am curious about user sessions. Have you seen any issues with increasing the replicas?

I have been trying to increase my AWX instance capacity.

Michandrz commented 2 years ago

Yeah, I have. I think that is more related to the GKE LB not maintaining session concurrency.

kladiv commented 2 years ago

Hi @Michandrz, so i guess the way to make it work is to use sticky sessions (session affinity).

for GKE ingress maybe you can look here: https://cloud.google.com/kubernetes-engine/docs/how-to/ingress-features#session_affinity
for NGINX Ingress here: https://kubernetes.github.io/ingress-nginx/examples/affinity/cookie/
for Traefik Ingress: https://doc.traefik.io/traefik/routing/services/#sticky-sessions

In scenario with external Application Load Balancer (L7), this must be configured on LB.

Michandrz commented 2 years ago

Finally had a chance to make that change and have my team looking out for the session tracing will report back with an update in a week or so.

pat-s commented 1 year ago

Setting replicas via

AWX:
  enabled: true
  name: awx
  spec:
    replicas: 2

and enabling sticky sessions in the ingress (e.g. NGINX):

AWX:
  spec:
    ingress_annotations: |
      nginx.ingress.kubernetes.io/affinity: "cookie"
      nginx.ingress.kubernetes.io/session-cookie-name: "AWX-SESSION-COOKIE-COM"
      nginx.ingress.kubernetes.io/session-cookie-expires: "172800"
      nginx.ingress.kubernetes.io/session-cookie-max-age: "172800"
      nginx.ingress.kubernetes.io/affinity-mode: persistent
      nginx.ingress.kubernetes.io/session-cookie-hash: sha1

results in a stable UE so far for us.

mtritu commented 1 year ago

With replicas: 2 and ingress persistent session cookie settings, it's controlled clients connection to specific pod but for jobs execution there is no HA as automation-jobs are sending to the same worker node (i.e: ip-10-207-49-132.ec2.internal).

$ kubectl get pod -o wide
NAME                           READY   STATUS    RESTARTS   AGE     IP              NODE                            NOMINATED NODE   READINESS GATES
automation-job-480662-g9dtj    1/1     Running   0          5m41s   10.207.49.188   ip-10-207-49-132.ec2.internal   <none>           <none>
automation-job-480664-wtfch    1/1     Running   0          5m21s   10.207.49.157   ip-10-207-49-132.ec2.internal   <none>           <none>
automation-job-480666-rf9qx    1/1     Running   0          18s     10.207.49.154   ip-10-207-49-132.ec2.internal   <none>           <none>
automation-job-480668-zmvpf    1/1     Running   0          10s     10.207.49.138   ip-10-207-49-132.ec2.internal   <none>           <none>

awx-eks-6b45f5579b-m95vn       4/4     Running   0          11d     10.207.49.185   ip-10-207-49-132.ec2.internal   <none>           <none>
awx-eks-6b45f5579b-q5ddh       4/4     Running   0          11d     10.207.49.113   ip-10-207-49-125.ec2.internal   <none>           <none>

awx-operator-79bc95f78-lhbht   1/1     Running   0          11d     10.207.49.172   ip-10-207-49-132.ec2.internal   <none>           <none>
$

Is it possible to configure automation-jobs execution to run on multiple Kubernetes nodes for HA/scaling?

Michandrz commented 1 year ago

Well, that was one long week....It's been running fine without any issues since enabling session affinity.

Michandrz commented 1 year ago

With replicas: 2 and ingress persistent session cookie settings, it's controlled clients connection to specific pod but for jobs execution there is no HA as automation-jobs are sending to the same worker node (i.e: ip-10-207-49-132.ec2.internal).

$ kubectl get pod -o wide
NAME                           READY   STATUS    RESTARTS   AGE     IP              NODE                            NOMINATED NODE   READINESS GATES
automation-job-480662-g9dtj    1/1     Running   0          5m41s   10.207.49.188   ip-10-207-49-132.ec2.internal   <none>           <none>
automation-job-480664-wtfch    1/1     Running   0          5m21s   10.207.49.157   ip-10-207-49-132.ec2.internal   <none>           <none>
automation-job-480666-rf9qx    1/1     Running   0          18s     10.207.49.154   ip-10-207-49-132.ec2.internal   <none>           <none>
automation-job-480668-zmvpf    1/1     Running   0          10s     10.207.49.138   ip-10-207-49-132.ec2.internal   <none>           <none>

awx-eks-6b45f5579b-m95vn       4/4     Running   0          11d     10.207.49.185   ip-10-207-49-132.ec2.internal   <none>           <none>
awx-eks-6b45f5579b-q5ddh       4/4     Running   0          11d     10.207.49.113   ip-10-207-49-125.ec2.internal   <none>           <none>

awx-operator-79bc95f78-lhbht   1/1     Running   0          11d     10.207.49.172   ip-10-207-49-132.ec2.internal   <none>           <none>
$

Is it possible to configure automation-jobs execution to run on multiple Kubernetes nodes for HA/scaling?

This is controlled by your K8 scheduler. if you go to your instance group in awx, you can specify custom spec overrides and then it's just a matter of using node affinity/anti-affinity

kladiv commented 1 year ago

Hi here, what about state of running Jobs? That is, if you run multiple replicas of AWX (e.g. 3) and

there're running jobs
you delete some AWX pods (one or two)

are the state of these running Job still managed by the remaining AWX pods?

2and3makes23 commented 1 year ago

are the state of these running Job still managed by the remaining AWX pods?

The README states:

During deployment restarts or new rollouts, when old ReplicaSet Pods are being terminated, the corresponding jobs which are managed (executed or controlled) by old AWX Pods may end up in Error state as there is no mechanism to transfer them to the newly spawned AWX Pods.

So terminationGracePeriod is a useful setting as long as there is no way to transfer running jobs to other/new replicas.

ansible / awx-operator