ansible / awx-operator

An Ansible AWX operator for Kubernetes built with Operator SDK and Ansible. 🤖
https://www.github.com/ansible/awx
Apache License 2.0
1.19k stars 603 forks source link

awx-web pod fail to start with error "Handshake status 500 Internal Server Error" #1866

Closed yyosha closed 1 month ago

yyosha commented 1 month ago

Please confirm the following

Bug Summary

New deployment of ver 2.16.1 using kustomize on existing cluster on EKS. Same exact deployment with ver 2.10.0 works perfect!

The deployment is stuck with awx-web CrushLoopBackOff.

$ kubectl describe pods -n awx awx-dev-web-c48c45544-ffqkw
...
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  23m                   default-scheduler  Successfully assigned awx/awx-dev-web-c48c45544-ffqkw to ip-10-167-0-76.ec2.internal
  Normal   Pulled     23m                   kubelet            Container image "quay.io/ansible/awx-ee:24.3.1" already present on machine
  Normal   Created    23m                   kubelet            Created container init
  Normal   Started    23m                   kubelet            Started container init
  Normal   Pulled     23m                   kubelet            Container image "quay.io/centos/centos:stream9" already present on machine
  Normal   Created    23m                   kubelet            Created container init-projects
  Normal   Started    23m                   kubelet            Started container init-projects
  Normal   Created    23m                   kubelet            Created container redis
  Normal   Pulled     23m                   kubelet            Container image "docker.io/redis:7" already present on machine
  Normal   Started    23m                   kubelet            Started container redis
  Normal   Pulled     23m                   kubelet            Container image "quay.io/ansible/awx:24.3.1" already present on machine
  Normal   Created    23m                   kubelet            Created container awx-dev-rsyslog
  Normal   Started    23m                   kubelet            Started container awx-dev-rsyslog
  Normal   Created    22m (x3 over 23m)     kubelet            Created container awx-dev-web
  Normal   Started    22m (x3 over 23m)     kubelet            Started container awx-dev-web
  Normal   Pulled     21m (x4 over 23m)     kubelet            Container image "quay.io/ansible/awx:24.3.1" already present on machine
  Warning  BackOff    3m35s (x75 over 22m)  kubelet            Back-off restarting failed container awx-dev-web in pod awx-dev-web-c48c45544-ffqkw_awx(6bf702c0-0617-48ed-b3dc-a9adb1d2ff46)

In operator logs I get this message:

...
TASK [installer : Get the new resource pod information after updating resource.] ***
task path: /opt/ansible/roles/installer/tasks/resources_configuration.yml:258
skipping: [localhost] => {\"changed\": false, \"false_condition\": \"this_deployment_result.changed\", \"skip_reason\": \"Conditional result was False\"}

TASK [installer : Update new resource pod as a variable.] **********************
task path: /opt/ansible/roles/installer/tasks/resources_configuration.yml:275
skipping: [localhost] => {\"changed\": false, \"false_condition\": \"this_deployment_result.changed\", \"skip_reason\": \"Conditional result was False\"}

TASK [installer : Update new resource pod name as a variable.] *****************
task path: /opt/ansible/roles/installer/tasks/resources_configuration.yml:283
skipping: [localhost] => {\"changed\": false, \"false_condition\": \"this_deployment_result.changed\", \"skip_reason\": \"Conditional result was False\"}

TASK [installer : Verify the resource pod name is populated.] ******************
task path: /opt/ansible/roles/installer/tasks/resources_configuration.yml:289
ok: [localhost] => {
    \"changed\": false,
    \"msg\": \"All assertions passed\"
}

TASK [installer : Migrate database to the latest schema] ***********************
task path: /opt/ansible/roles/installer/tasks/install.yml:97
included: /opt/ansible/roles/installer/tasks/migrate_schema.yml for localhost

TASK [installer : Check for pending migrations] ********************************
task path: /opt/ansible/roles/installer/tasks/migrate_schema.yml:3
fatal: [localhost]: FAILED! => {\"changed\": false, \"msg\": \"Failed to execute on pod awx-dev-web-c48c45544-ffqkw due to : (0)\
Reason: Handshake status 500 Internal Server Error -+-+- {'content-length': '35', 'content-type': 'text/plain; charset=utf-8', 'date': 'Tue, 21 May 2024 15:07:16 GMT'} -+-+- b'container not found ("awx-dev-web")'\
\"}

PLAY RECAP *********************************************************************
localhost                  : ok=71   changed=0    unreachable=0    failed=1    skipped=68   rescued=0    ignored=0   
","job":"6881205681729212860","name":"awx-dev","namespace":"awx","error":"exit status 2","stacktrace":"github.com/operator-framework/ansible-operator-plugins/internal/ansible/runner.(*runner).Run.func1
\tansible-operator-plugins/internal/ansible/runner/runner.go:269"

----- Ansible Task Status Event StdOut (awx.ansible.com/v1beta1, Kind=AWX, awx-dev/awx) -----

PLAY RECAP *********************************************************************
localhost                  : ok=71   changed=0    unreachable=0    failed=1    skipped=68   rescued=0    ignored=0

----------
{"level":"error","ts":"2024-05-21T15:07:16Z","msg":"Reconciler error","controller":"awx-controller","object":{"name":"awx-dev","namespace":"awx"},"namespace":"awx","name":"awx-dev","reconcileID":"76492afe-ee5f-46c4-9f6d-0f7a821852a4","error":"event runner on failed","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
  /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
  /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
  /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":"2024-05-21T15:07:17Z","logger":"logging_event_handler","msg":"[playbook task start]","name":"awx-dev","namespace":"awx","gvk":"awx.ansible.com/v1beta1, Kind=AWX","event_type":"playbook_on_task_start","job":"3631807449646318833","EventData.Name":"Verify imagePullSecrets"}

AWX Operator version

2.16.1

AWX version

24.3.1

Kubernetes platform

kubernetes

Kubernetes/Platform version

1.29

Modifications

yes

Steps to reproduce

EKS cluster exists with csi driver for efs & ebs, alb. All updated to latest.

All efs filesystems cleared of any data.

$ kubectl apply -k .

Expected results

Running AWX with web access to console.

Actual results

The deployment is stuck with awx-web CrushLoopBackOff.

$ kubectl describe pods -n awx awx-dev-web-c48c45544-ffqkw
...
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  23m                   default-scheduler  Successfully assigned awx/awx-dev-web-c48c45544-ffqkw to ip-10-167-0-76.ec2.internal
  Normal   Pulled     23m                   kubelet            Container image "quay.io/ansible/awx-ee:24.3.1" already present on machine
  Normal   Created    23m                   kubelet            Created container init
  Normal   Started    23m                   kubelet            Started container init
  Normal   Pulled     23m                   kubelet            Container image "quay.io/centos/centos:stream9" already present on machine
  Normal   Created    23m                   kubelet            Created container init-projects
  Normal   Started    23m                   kubelet            Started container init-projects
  Normal   Created    23m                   kubelet            Created container redis
  Normal   Pulled     23m                   kubelet            Container image "docker.io/redis:7" already present on machine
  Normal   Started    23m                   kubelet            Started container redis
  Normal   Pulled     23m                   kubelet            Container image "quay.io/ansible/awx:24.3.1" already present on machine
  Normal   Created    23m                   kubelet            Created container awx-dev-rsyslog
  Normal   Started    23m                   kubelet            Started container awx-dev-rsyslog
  Normal   Created    22m (x3 over 23m)     kubelet            Created container awx-dev-web
  Normal   Started    22m (x3 over 23m)     kubelet            Started container awx-dev-web
  Normal   Pulled     21m (x4 over 23m)     kubelet            Container image "quay.io/ansible/awx:24.3.1" already present on machine
  Warning  BackOff    3m35s (x75 over 22m)  kubelet            Back-off restarting failed container awx-dev-web in pod awx-dev-web-c48c45544-ffqkw_awx(6bf702c0-0617-48ed-b3dc-a9adb1d2ff46)

Additional information

Customized awx-ee:

$ cat execution-environment.yml
---
version: 3

images:
  base_image:
    name: quay.io/ansible/awx-ee:latest

dependencies:
  ansible_core:
    # Require minimum of 2.15 to get ansible-inventory --limit option
    package_pip: ansible-core>=2.15.0rc2,<2.16
  ansible_runner:
    package_pip: ansible-runner
  galaxy: requirements.yml
  system: bindep.txt
  python: requirements.txt

additional_build_files:
  - src: ansible.cfg
    dest: configs
  - src: ca-extras
    dest: ca-extras

additional_build_steps:
  prepend_galaxy:
    - ADD _build/configs/ansible.cfg ~/.ansible.cfg
  append_base:
    - RUN $PYCMD -m pip install -U pip
    - ADD _build/ca-extras ./ca-extras
    - RUN cp ./ca-extras/My-Root-CA.pem /etc/pki/ca-trust/source/anchors; update-ca-trust extract
  append_final:
    - COPY --from=quay.io/ansible/receptor:devel /usr/bin/receptor /usr/bin/receptor
    - RUN mkdir -p /var/run/receptor
    - RUN git lfs install --system
    - RUN python3 -m pip install boto3
    - RUN unzip ./ca-extras/awscliv2.zip; ./aws/install
    - RUN dnf install -y https://s3.amazonaws.com/session-manager-downloads/plugin/latest/linux_64bit/session-manager-plugin.rpm
$ cat bindep.txt
git-core [platform:rpm]
python3.9-devel [platform:rpm compile]
libcurl-devel [platform:rpm compile]
krb5-devel [platform:rpm compile]
krb5-workstation [platform:rpm]
subversion [platform:rpm]
subversion [platform:dpkg]
git-lfs [platform:rpm]
sshpass [platform:rpm]
rsync [platform:rpm]
epel-release [platform:rpm]
python-unversioned-command [platform:rpm]
unzip [platform:rpm]
jq [platform:rpm]
podman-remote [platform:rpm]
cmake [platform:rpm compile]
gcc [platform:rpm compile]
gcc-c++ [platform:rpm compile]
make [platform:rpm compile]
openssl-devel [platform:rpm compile]
$ cat requirements.yml
---
collections:
- amazon.aws
- ansible.posix
- ansible.windows
- awx.awx
- azure.azcollection
- community.aws
- community.general
- community.vmware
- google.cloud
- kubernetes.core
- openstack.cloud
- ovirt.ovirt
- redhatinsights.insights
- theforeman.foreman
- cyberark.conjur

Customized resources:

---
apiVersion: awx.ansible.com/v1beta1
kind: AWX

metadata:
  name: awx-dev

spec:
  ## These parameters are designed for use with:
  ## - AWX Operator: 2.10
  ##   https://github.com/ansible/awx-operator/blob/2.10.0/README.md
  ## - AWX: 23.6.0
  ##   https://github.com/ansible/awx/blob/23.6.0/INSTALL.md
  ##
  ## Upgraded to:
  ## - AWX Operator: 2.16.1
  ##   https://github.com/ansible/awx-operator/blob/2.16.1/README.md
  ## - AWX: 24.3.1
  ##   https://github.com/ansible/awx/blob/24.3.1/INSTALL.md

  ## This line controls the log output of the deployment
  no_log: false

  ## Disable ip_v6
  ipv6_disabled: true

  ##################################
  ##              awx             ##
  ##################################

  admin_user: admin
  admin_password_secret: awx-admin-password
  bundle_cacert_secret: my-ca-bundle

  ## hostname value is used in the ALB Listener rules
  ## if host is equal to <hostname value> then traffic will be forwarded to Target Group
  hostname: awx-dev.mydom.com

  ## Customized control-plane-ee
  control_plane_ee_image: myrepo/my-awx-ee:2.16.1_1

  ## Customized awx-ee
  ee_images:
    - name: custom-awx-ee
      image: myrepo/my-awx-ee:2.16.1_1

  ## Custom ee docker pull secret
  image_pull_secrets:
    - awx-custom-ee-docker-reg-secret

  ## console listens on nodes port so ALB ingress can be used
  service_type: NodePort
  nodeport_port: 30080

  ## make projects data persistent on EFS
  ## need storage class, filesystem & mount points on all subnets to be pre-configured
  projects_persistence: true
#  ## use either -
#  ## 'projects_storage_class' for dynamic allocation of persistent volume
#  ## 'projects_existing_claim' for pre-configured persistent volume claim
#  projects_storage_class: efs-projects-storageclass
  projects_existing_claim: awx-projects-claim

  ##################################
  ##            ingress           ##
  ##################################

  ingress_type: ingress
  ingress_path: '/'
  ingress_path_type: Prefix
  ingress_annotations: |
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}, {"HTTP":80}]'
    alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
    alb.ingress.kubernetes.io/certificate-arn: "arn:aws:acm:us-east-1:123456789012:certificate/37678af9-d455-4905-9385-b6369e7a23d7"
    alb.ingress.kubernetes.io/ssl-policy: 'ELBSecurityPolicy-TLS13-1-2-Res-2021-06'
    alb.ingress.kubernetes.io/scheme: 'internal'
    alb.ingress.kubernetes.io/target-type: 'instance'
    alb.ingress.kubernetes.io/ip-address-type: 'ipv4'
    alb.ingress.kubernetes.io/security-groups: 'sg-123456789012'
    alb.ingress.kubernetes.io/load-balancer-attributes: 'idle_timeout.timeout_seconds=360'
    alb.ingress.kubernetes.io/healthcheck-protocol: HTTP
    alb.ingress.kubernetes.io/healthcheck-port: traffic-port
    alb.ingress.kubernetes.io/healthcheck-interval-seconds: '15'
    alb.ingress.kubernetes.io/healthcheck-timeout-seconds: '5'
    alb.ingress.kubernetes.io/success-codes: '200'
    alb.ingress.kubernetes.io/healthy-threshold-count: '2'
    alb.ingress.kubernetes.io/unhealthy-threshold-count: '2'
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: 'true'

  ##################################
  ##          postgresql          ##
  ##################################

  ## enable migration to new Postgresql db
  #old_postgres_configuration_secret: old-awx-postgres-configuration

  postgres_configuration_secret: awx-postgres-configuration

  ## make postgress db persistent on EFS
  ## need storage class, filesystem & mount points on all subnets to be pre-configured
  postgres_storage_class: efs-postgres-15-storageclass
  postgres_storage_requirements:
    requests:
      storage: 15Gi
    limits:
      storage: 35Gi

## EOF

Operator Logs

...
TASK [installer : Get the new resource pod information after updating resource.] ***
task path: /opt/ansible/roles/installer/tasks/resources_configuration.yml:258
skipping: [localhost] => {\"changed\": false, \"false_condition\": \"this_deployment_result.changed\", \"skip_reason\": \"Conditional result was False\"}

TASK [installer : Update new resource pod as a variable.] **********************
task path: /opt/ansible/roles/installer/tasks/resources_configuration.yml:275
skipping: [localhost] => {\"changed\": false, \"false_condition\": \"this_deployment_result.changed\", \"skip_reason\": \"Conditional result was False\"}

TASK [installer : Update new resource pod name as a variable.] *****************
task path: /opt/ansible/roles/installer/tasks/resources_configuration.yml:283
skipping: [localhost] => {\"changed\": false, \"false_condition\": \"this_deployment_result.changed\", \"skip_reason\": \"Conditional result was False\"}

TASK [installer : Verify the resource pod name is populated.] ******************
task path: /opt/ansible/roles/installer/tasks/resources_configuration.yml:289
ok: [localhost] => {
    \"changed\": false,
    \"msg\": \"All assertions passed\"
}

TASK [installer : Migrate database to the latest schema] ***********************
task path: /opt/ansible/roles/installer/tasks/install.yml:97
included: /opt/ansible/roles/installer/tasks/migrate_schema.yml for localhost

TASK [installer : Check for pending migrations] ********************************
task path: /opt/ansible/roles/installer/tasks/migrate_schema.yml:3
fatal: [localhost]: FAILED! => {\"changed\": false, \"msg\": \"Failed to execute on pod awx-dev-web-c48c45544-ffqkw due to : (0)\
Reason: Handshake status 500 Internal Server Error -+-+- {'content-length': '35', 'content-type': 'text/plain; charset=utf-8', 'date': 'Tue, 21 May 2024 15:07:16 GMT'} -+-+- b'container not found ("awx-dev-web")'\
\"}

PLAY RECAP *********************************************************************
localhost                  : ok=71   changed=0    unreachable=0    failed=1    skipped=68   rescued=0    ignored=0   
","job":"6881205681729212860","name":"awx-dev","namespace":"awx","error":"exit status 2","stacktrace":"github.com/operator-framework/ansible-operator-plugins/internal/ansible/runner.(*runner).Run.func1
\tansible-operator-plugins/internal/ansible/runner/runner.go:269"

----- Ansible Task Status Event StdOut (awx.ansible.com/v1beta1, Kind=AWX, awx-dev/awx) -----

PLAY RECAP *********************************************************************
localhost                  : ok=71   changed=0    unreachable=0    failed=1    skipped=68   rescued=0    ignored=0

----------
{"level":"error","ts":"2024-05-21T15:07:16Z","msg":"Reconciler error","controller":"awx-controller","object":{"name":"awx-dev","namespace":"awx"},"namespace":"awx","name":"awx-dev","reconcileID":"76492afe-ee5f-46c4-9f6d-0f7a821852a4","error":"event runner on failed","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
  /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
  /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
  /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":"2024-05-21T15:07:17Z","logger":"logging_event_handler","msg":"[playbook task start]","name":"awx-dev","namespace":"awx","gvk":"awx.ansible.com/v1beta1, Kind=AWX","event_type":"playbook_on_task_start","job":"3631807449646318833","EventData.Name":"Verify imagePullSecrets"}
yyosha commented 1 month ago

Closed since this is part of https://github.com/ansible/awx-operator/issues/1870