ansible / awx-operator

An Ansible AWX operator for Kubernetes built with Operator SDK and Ansible. 🤖
https://www.github.com/ansible/awx
Apache License 2.0
1.23k stars 621 forks source link

AWX web/task pod not launching correctly - "Waiting for database migrations..." and cannot execute awx-manage commands "connection is bad: Name or service not known" #1636

Open containerckf opened 9 months ago

containerckf commented 9 months ago

Please confirm the following

Bug Summary

When installing the recent (2.7.1) version of AWX (and others) on a v1.28 EKS Cluster, AWX does not correctly initialize.

The Web pod gets stuck in a loop of trying to complete the database migration (even though the deployment is fresh). When trying to run 'awx-manage' commands from the pod there are "connection is bad: Name or service not known" errors received.

All the pods are in Running state, but the web pod target shows as Unhealthy and when trying to access the ALB endpoint, the AWX interface does not come up. This was verified by port forwarding the web pod and hitting the IP directly, confirming the ALB was routing correctly.

AWX Operator version

2.7.1

AWX version

23.3.1

Kubernetes platform

kubernetes

Kubernetes/Platform version

1.28

Modifications

no

Steps to reproduce

Installation is performed via "kustomization" and ingress YAML per outlined here.

kustomization.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

generatorOptions:
  disableNameSuffixHash: true

secretGenerator:
  - name: awx-postgres-configuration
    type: Opaque
    literals:
      - host=awx-postgres
      - port=5432
      - database=awx
      - username=awx
      - password=Ansible123!
      - type=managed

  - name: awx-admin-password
    type: Opaque
    literals:
      - password=Ansible123!

resources:
  # Find the latest tag here: https://github.com/ansible/awx-operator/releases
  - github.com/ansible/awx-operator/config/default?ref=2.7.1
  - awx-ingress.yaml

# Set the image tags to match the git version from above
images:
  - name: quay.io/ansible/awx-operator
    newTag: 2.7.1

# Specify a custom namespace in which to install AWX
namespace: awx

awx-ingress.yaml


---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
name: awx
spec:

admin_user: admin admin_password_secret: awx-admin-password

ingress_type: ingress ingress_path: "/" ingress_path_type: Prefix hostname: awx.dev.compucom.io ingress_annotations: | alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]' alb.ingress.kubernetes.io/actions.redirect: "{\"Type\": \"redirect\", \"RedirectConfig\": {\"Protocol\": \"HTTPS\", \"Port\": \"443\", \"StatusCode\": \"HTTP_301\"}}" alb.ingress.kubernetes.io/scheme: internet-facing alb.ingress.kubernetes.io/target-type: ip kubernetes.io/ingress.class: alb alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789012:certificate/XXX alb.ingress.kubernetes.io/load-balancer-attributes: "idle_timeout.timeout_seconds=360"

postgres_configuration_secret: awx-postgres-configuration


These files are deployed with command-

$ kubectl apply -k .

### Expected results

Database to initialize / web service to become ready (connect to AWX service via ALB to target pod running in EKS)

### Actual results

1. AWX Web container never properly starts - the following is seen in logs...

kubectl -n awx logs awx-web-7b9777b649-5sw8g [wait-for-migrations] Waiting for database migrations... [wait-for-migrations] Attempt 1 of 30 [wait-for-migrations] Waiting 0.5 seconds before next attempt [wait-for-migrations] Attempt 2 of 30 [wait-for-migrations] Waiting 1 seconds before next attempt [wait-for-migrations] Attempt 3 of 30 ...

"playbook task failed" {"level":"error","ts":"2023-09-19T12:21:02Z","msg":"Reconciler error","controller":"awx-controller","object":{"name":"awx","namespace":"awx"},"namespace":"awx","name":"awx","reconcileID":"3fbdfdb3-552a-47bd-a28a-f1dc6c1d9d9e","error":"event runner on failed","stacktrace"


2. Targets (AWX web pod) is Unhealthy on the AWS Console end. Pod is Running, but cannot "initialize" database, there is no actual database to migrate. These are fresh installs every time.

### Additional information

1. Similar issue [here](https://github.com/ansible/awx/issues/6539#issuecomment-609106337) mentioned an awx-manage command solution. We are able to exec into the container -  `awx-manage migrate --noinput` yields the output..

django.db.utils.OperationalError: connection is bad: Name or service not known psycopg.OperationalError: connection is bad: Name or service not known


2. Performed steps on related [GitHub](https://github.com/ansible/awx-operator/issues/1506#issuecomment-1656740926) - (setting the psql awx user password) and did not resolve any issues..

Why can't AWX correctly initizalize? Also verified the named packages above were present. What could be inhibiting the connection?

### Operator Logs

_No response_
fosterseth commented 9 months ago

django.db.utils.OperationalError: connection is bad: Name or service not known

seems you have connectivity issues to your DB. So you may need to take some debugging steps to see why the connections are failing

Looks like you are setting up an internal database (running as pod inside of the same cluster as awx-task/web). is that right? if so, you shouldn't need to set the postgres_configuration_secret. Does it work fine without setting the postgres configuration?

sasvari-attila-bosch commented 3 months ago

I experience a quite similar issue when installing AWX with the operator version 2.18.0, although my setup somewhat different.

I have my Postgres in Azure behind a VNet. I mount a custom /etc/resolv.conf through ee_extra_volume_mounts, task_extra_volume_mounts, init_container_extra_volume_mounts, etc. From there (e.g. awx-web, awx-task) my Azure Postgres Flexi server is reachable.

However, the AWX migration Job does not configured to use it, and its pods apparently can't resolve the address of my Postgres (django.db.utils.OperationalError: [Errno -2] Name or service not known).

Is there a way to configure the the migration Job to use Azure's DNS resolver?

vpelagatti commented 3 weeks ago

@sasvari-attila-bosch , have you solve this issue? I'm facing the same problem

sasvari-attila-bosch commented 2 weeks ago

@sasvari-attila-bosch , have you solve this issue? I'm facing the same problem

@vpelagatti, I wasn't able to resolve it using version 2.18.0, and I haven't tried with 2.19.*.