ansible / awx-operator

An Ansible AWX operator for Kubernetes built with Operator SDK and Ansible. 🤖
https://www.github.com/ansible/awx
Apache License 2.0
1.26k stars 632 forks source link

task and web replicas are scaled to 0 by the operator #1960

Open jdratlif opened 2 months ago

jdratlif commented 2 months ago

Please confirm the following

Bug Summary

After installing a new awx with awx-operator, it scales the web and task deployments down to 0 and awx is completely stopped. It never scales the deployments back up.

AWX Operator version

2.19.1

AWX version

1.27.12

Kubernetes platform

kubernetes

Kubernetes/Platform version

k3s

Modifications

no

Steps to reproduce

---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  # Find the latest tag here: https://github.com/ansible/awx-operator/releases
  - github.com/ansible/awx-operator/config/default?ref=2.19.1
  - awx.yaml

# Set the image tags to match the git version from above
images:
  - name: quay.io/ansible/awx-operator
    newTag: 2.19.1

# Specify a custom namespace in which to install AWX
namespace: awx-test
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: le-staging
spec:
  acme:
    privateKeySecretRef:
      name: le-staging
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    solvers:
      - http01:
          ingress:
            ingressClassName: traefik
---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: jdr1
spec:
  hostname: jdr1.k8s.test.example.com
  ingress_type: ingress
  ingress_annotations: |
    cert-manager.io/issuer: le-staging
    traefik.ingress.kubernetes.io/router.middlewares: default-bastion-office-vpn@kubernetescrd
  ingress_tls_secret: awx-tls-le-staging
  service_type: ClusterIP
  postgres_data_volume_init: true

Expected results

I expected awx to be running.

Actual results

It starts up, then gets stopped, and doesn't restart without manual intervention.

Additional information

If I use awx-operator 2.18, I don't have this problem. It seems like the problem happened something in 2.19.0 or 2.19.1 release.

Operator Logs

 TASK [Apply deployment resources] ******************************** 
fatal: [localhost]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'web_manage_replicas' is undefined. 'web_manage_replicas' is undefined. 'web_manage_replicas' is undefined. 'web_manage_replicas' is undefined\n\nThe error appears to be in '/opt/ansible/roles/installer/tasks/resources_configuration.yml': line 248, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Apply deployment resources\n  ^ here\n"}

I saw this referenced in https://github.com/ansible/awx-operator/issues/1907, but I'm not upgrading from 2.18, and re-applying the CRDs didn't fix things for me.

jdratlif commented 2 months ago

It's not clear to me how the awx CRD spec values get translated into ansible vars, but https://github.com/ansible/awx-operator/commit/8ead140541622f67bd2d44a3c76bb05739cdebb6 this commit added the web_manage_replicas and task_manage_replicas saying the default is true, but there were no new defaults added to defaults/main.yml to configure them. But web_replicas and task_replicas are set to empty strings there. Do we not need web_manage_replicas and task_manage_replicas set to true in the defaults there as well?

jdratlif commented 2 months ago

Okay, I think I know what is happening.

Another person is using the awx operator in our cluster. I didn't think this would matter because we're using different namespaces. But the CRDs are not namespaced, so the CRDs are being overwritten at "random" times, and then I lose the values from the newer CRD definitions.

---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  # Find the latest tag here: https://github.com/ansible/awx-operator/releases
  - github.com/ansible/awx-operator/config/default?ref=2.19.1
  # - awx.yaml

# Set the image tags to match the git version from above
images:
  - name: quay.io/ansible/awx-operator
    newTag: 2.19.1

# Specify a custom namespace in which to install AWX
namespace: sea

Running kustomize on this does not fix the CRDs. But doing kubectl apply --server-side --force-conflicts -k "github.com/ansible/awx-operator/config/crd?ref=2.19.1" does, at least until whatever helm job installs the older awx-operator on the other namespace kicks in. Downgrades work, upgrades don't? Or maybe it's kustomize vs helm. I'm not sure. I do know that the the CRDs are being overwritten, because after I delete my namespace and try to start over, I can check for postgres_data_volume_init in the CRD and it will be a field, but if I keep checking, it will disappear and postgres_data_path will be there.