ansible / awx-operator

An Ansible AWX operator for Kubernetes built with Operator SDK and Ansible. 🤖
https://www.github.com/ansible/awx
Apache License 2.0
1.26k stars 632 forks source link

Bad Gateway caused by unnecessary database migration on new installation on K8s #848

Open johanneskastl opened 2 years ago

johanneskastl commented 2 years ago

Please confirm the following

Summary

New installation of awx-controller and awx in a Kubernetes cluster.

Then installation finished, the pods are running, but reaching the website only returns Bad Gateway.

For some reason, the awx-web pod tries to migrate the database (even if there is nothing to migrate, as it was just created).

AWX version

awx-controller 0.17.0

Select the relevant components

Installation method

kubernetes

Modifications

no

Ansible version

not relevant, as Kubernetes only

Operating system

not relevant, as Kubernetes only

Web browser

No response

Steps to reproduce

git clone https://github.com/ansible/awx-operator.git
cd awx-operator
git checkout 0.17.0

$ Deploy new AWX Operator
export NAMESPACE=awx
make deploy

Then create a awx.yaml (mostly just reducing the limits/requests):

---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx-ojkastl
spec:
  projects_persistence: true
  projects_storage_access_mode: ReadWriteMany
  projects_storage_size: 1Gi
  admin_user: <redacted>
  admin_email: <redacted>
  admin_password_secret: awx-admin-password 
  service_type: ClusterIP
  ingress_type: ingress
  hostname: <redacted>
  web_resource_requirements:
    requests:
      cpu: 250m
      memory: 500Mi
    limits:
      cpu: 250m
      memory: 500Mi
  ee_resource_requirements:
    requests:
      cpu: 250m
      memory: 500Mi
    limits:
      cpu: 250m
      memory: 500Mi
  task_resource_requirements:
    requests:
      cpu: 250m
      memory: 500Mi
    limits:
      cpu: 250m
      memory: 500Mi
  postgres_resource_requirements:
    requests:
      cpu: 250m
      memory: 1Gi
    limits:
      cpu: 250m
      memory: 1Gi

Apply the file, wait, wait a little more. Check all pods are running. Then curl the ingress, and you get Bad Gateway. Check the logs of the pod and you get something like this:

[wait-for-migrations] Waiting for database migrations...
[wait-for-migrations] Attempt 1 of 30
[wait-for-migrations] Waiting 0.5 seconds before next attempt
[wait-for-migrations] Attempt 2 of 30
[wait-for-migrations] Waiting 1 seconds before next attempt
[...]

Expected results

On new installations no database migration is necessary, hince it should not be executed.

Actual results

The UI is not reachable, due to Bad Gateway. And the pods are never finishing their database migration.

Additional information

Even though it should not matter, this is a 3-node k3s cluster running v1.22.7+k3s1.

WebSpider commented 2 years ago

Workaround for this is to cycle the database pod. The migration is then aborted, and not restarted.

johanneskastl commented 2 years ago

Workaround for this is to cycle the database pod. The migration is then aborted, and not restarted.

Thanks for the tipp. But IMHO this requires a solution, not a workaround. It does not make a good impression if a newly installed AWX does not start properly... :-)

WebSpider commented 2 years ago

I absolutely agree with you!

merickso commented 2 years ago

I'm seeing the same thing but cycling the postgres pod and the awx pod doesn't fix the problem?

jompins commented 2 years ago

I am seeing a similar issue on a fresh install. It seems like there is no database getting deployed during a fresh install. At least awx-manage can not connect to the database pod and the pg_lsclusters doesn't show a cluster up and running. Cycling the db pod also doesn't solve the issue

fosterseth commented 2 years ago

db migrations are still required on a fresh install, since we haven't squashed all of our migration files into a singular file.

Do the migrations eventually run to completion (can take a while, give it a good 20 minutes), after which the UI will start being responsive?

mac-chaffee commented 2 years ago

Sometimes the message about waiting for migrations can be misleading. The script /usr/local/bin/wait-for-migrations just runs awx-manage check and awx-manage showmigrations. Those can error for other reasons, like malformed LDAP config causing settings.py to throw a syntax error.

johanneskastl commented 2 years ago

Do the migrations eventually run to completion (can take a while, give it a good 20 minutes), after which the UI will start being responsive?

I did not encounter this issue in my recent tests, but it seems like there might be some kind of race condition or moon phase or similar, so it might or might not happen... :-(

jompins commented 2 years ago

At least on my end I was able to troubleshoot this down to problems with containerd, not properly setting up the container NAT. The pods were simply not able to connect to each other. Switching back to legacy IP tables on k3s node hosts worked again.

arodriguezd commented 2 years ago

In my case it also showed me the bad gateway error, and the pod showed me the message you indicated. I use Ubuntu 20 with k3s, but I had disabled IPv6 at the Kernel level.

So I re-enabled it and it no longer gave me problems, I have followed the recommendations it gives:

https://github.com/kurokobo/awx-on-k3s

It is very complete.

estsauver commented 2 years ago

I'm seeing an error trying to deploy to a fresh eks cluster.

kubectl logs -f deployments/awx-operator-controller-manager -c awx-manager -n awx


-------------------------------------------------------------------------------
{"level":"info","ts":1658650589.8996763,"logger":"runner","msg":"Ansible-runner exited successfully","job":"7807791897404560431","name":"awx-demo","namespace":"awx"}

----- Ansible Task Status Event StdOut (awx.ansible.com/v1beta1, Kind=AWX, awx-demo/awx) -----

PLAY RECAP *********************************************************************
localhost                  : ok=66   changed=2    unreachable=0    failed=0    skipped=45   rescued=0    ignored=0   
estsauver@Earls-MBP k8s % kubectl logs awx-demo-bcb97966d-j7rph -n awx -c awx-demo-web                      
[wait-for-migrations] Waiting for database migrations...
[wait-for-migrations] Attempt 1 of 30
[wait-for-migrations] Waiting 0.5 seconds before next attempt
[wait-for-migrations] Attempt 2 of 30
[wait-for-migrations] Waiting 1 seconds before next attempt
[wait-for-migrations] Attempt 3 of 30
[wait-for-migrations] Waiting 2 seconds before next attempt
[wait-for-migrations] Attempt 4 of 30
[wait-for-migrations] Waiting 4 seconds before next attempt
[wait-for-migrations] Attempt 5 of 30
[wait-for-migrations] Waiting 8 seconds before next attempt
[wait-for-migrations] Attempt 6 of 30
[wait-for-migrations] Waiting 16 seconds before next attempt
kcslb92 commented 2 years ago

Hi Guys,

I've also got this issue deploying into k3s as per https://github.com/kurokobo/awx-on-k3s on Centos 8. I have tried cycling the database pod, I assume you just meant to delete it and have it recreate?

Hoping someone has some steps on how to rectify this :).

Cheers!

phinx110 commented 2 years ago

Wiped my v0.25.0 operator to install the new v0.28.0 release. Clean install did not work. Looked at the operator logs with no_log: set to "false". The following task failed: TASK [installer : Create super user via Django if it doesn't exist.] ***********

I got the following trace:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 219, in ensure_connection
    self.connect()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 200, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 187, in get_new_connection
    connection = Database.connect(**conn_params)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/psycopg2/__init__.py", line 126, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: could not translate host name "awx-postgres" to address: Name or service not known

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/bin/awx-manage", line 8, in <module>
    sys.exit(manage())
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/__init__.py", line 185, in manage
    if (connection.pg_version // 10000) < 12:
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/connection.py", line 15, in __getattr__
    return getattr(self._connections[self._alias], item)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/functional.py", line 48, in __get__
    res = instance.__dict__[self.name] = self.func(instance)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 329, in pg_version
    with self.temporary_connection():
  File "/usr/lib64/python3.9/contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 603, in temporary_connection
    with self.cursor() as cursor:
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 259, in cursor
    return self._cursor()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 235, in _cursor
    self.ensure_connection()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 219, in ensure_connection
    self.connect()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 219, in ensure_connection
    self.connect()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 200, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 187, in get_new_connection
    connection = Database.connect(**conn_params)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/psycopg2/__init__.py", line 126, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
django.db.utils.OperationalError: could not translate host name "awx-postgres" to address: Name or service not known

I assume that the service that points to the postgres got renamed from "awx-postgres" to "awx-postgres-13", but the task still references "awx-postgres". I created my own "awx-postgres" service as a workaround that redirects "awx-postgres" trafic to "awx-postgres-13" using externalName:

kind: Service
apiVersion: v1
metadata:
  name: awx-postgres
  namespace: awx
spec:
  type: ExternalName
  externalName: awx-postgres-13.awx.svc.cluster.local

Then after a while the migration succeeded. (shell into the pod and run: psql --user awx and then \c awx to see the tables). After a while I was able to see the login screen. You might need to wait some time though. You may or may not need to kill both postgres and awx pods if it doesn't work yet.

kurokobo commented 2 years ago

@phinx110 I think it's not a bug. AWX uses hostname for PSQL from Secret resource that created via Operator. Operator 0.28.0 creates Secret with hostname <instance name>-postgres-<version> correctly: https://github.com/ansible/awx-operator/blob/0.28.0/roles/installer/templates/secrets/postgres_secret.yaml.j2#L19

Wiped my v0.25.0 operator to install the new v0.28.0 release

I guess your old Secret resource with old hostname had reused since it had not wiped correctly. AWX Operator reuses Secret if it already exists.

phinx110 commented 2 years ago

@kurokobo This was indeed the case.

Now I have Wiped my cluster again and I have deleted all remaining secrets and configmaps (just to be sure) inside the awx namespace and reinstalled the entire stack. Now i get the following failure:

TASK [installer : Check if there are any super users defined.] *****************
task path: /opt/ansible/roles/installer/tasks/initialize_django.yml:2

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 219, in ensure_connection
    self.connect()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 200, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 187, in get_new_connection
    connection = Database.connect(**conn_params)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/psycopg2/__init__.py", line 126, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: FATAL:  password authentication failed for user "awx"

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/bin/awx-manage", line 8, in <module>
    sys.exit(manage())
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/__init__.py", line 185, in manage
    if (connection.pg_version // 10000) < 12:
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/connection.py", line 15, in __getattr__
    return getattr(self._connections[self._alias], item)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/functional.py", line 48, in __get__
    res = instance.__dict__[self.name] = self.func(instance)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 329, in pg_version
    with self.temporary_connection():
  File "/usr/lib64/python3.9/contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 603, in temporary_connection
    with self.cursor() as cursor:
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 259, in cursor
    return self._cursor()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 235, in _cursor
    self.ensure_connection()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 219, in ensure_connection
    self.connect()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 219, in ensure_connection
    self.connect()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 200, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 187, in get_new_connection
    connection = Database.connect(**conn_params)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/psycopg2/__init__.py", line 126, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
django.db.utils.OperationalError: FATAL:  password authentication failed for user "awx"

TASK [installer : Create super user via Django if it doesn't exist.] fails for the same reason but at different code lines.

P.S. I solved this by:

kurokobo commented 2 years ago

@phinx110 Hmm, have you wiped actual data in the PV for PSQL before re-installation? AWX Operator can't update password for specified user if the user already exists in PSQL. I don't know what storage type is used for your PV, but sometimes wiping data in PV manually is required for some stolage type e.g. hostPath or NFS since the new PV is created with existing data files and the data will be reused by new PSQL instance.

phinx110 commented 2 years ago

@kurokobo I'm not sure regarding the PV. I didn't want to tear down my current setup so I could test this out specifically because I need to get some work done. I tried installing the operator in another namespace to create a separate awx setup on my dev cluster but I got: Helm install failed: clusterroles.rbac.authorization.k8s.io "awx-operator-proxy-role" already exists, because my first operator was still installed. I'll keep an eye on the fresh install scenario when I deploy to the staging server.

phinx110 commented 2 years ago

@kurokobo So I installed v0.29.0 operator and an awx instance on a fresh untouched server and it was successful. I did not need to do anything manually to get it working. I did had to wait a bit for it to come through.

bert-jan commented 2 years ago

Had the same experience deploying into k3s as per https://github.com/kurokobo/awx-on-k3s (tag 0.30.0) on Ubuntu 20.04. After applying the service workaround as suggested by @phinx110 awx gui started working. I guess the version number should be removed as it might cause future issues when moving to postgres v14/15 etc??

ghost commented 1 year ago

I also face this problem, when I use awx operator to deploy awx. I found postgresql user awx have no password , then I set the awx password as same as in the secret , next delete awx pod , after doing this, everything goes ok!

Gokusan31 commented 1 year ago

Hello,

Got a similar problem but with curious error about postgres (found in awx-controller logs): File \"/var/lib/awx/venv/awx/lib64/python3.9/site-packages/psycopg/connection.py\", line 728, in connect", " raise ex.with_traceback(None)", "django.db.utils.OperationalError: connection is bad: Name or service not known"], "stdout": "", "stdout_lines": []}

Try to do an awx-manage create_preload_data and got this "connection is bad: Name or service not known" Curl is not working from awx-task or web container to postgres container @IP:5432

My pods: [root@cad-pod-01:~]# k get pod NAME READY STATUS RESTARTS AGE ansible-awx-postgres-13-0 1/1 Running 0 12m ansible-awx-web-84c8ff665-gxlft 3/3 Running 0 12m ansible-awx-task-5bbbc974dd-4gcwb 4/4 Running 0 12m awx-operator-controller-manager-7978c48674-b4csv 2/2 Running 0 12m

After 30 retry, pods restart

Initial configuration use a proxy, i unset it everywhere (env var and systemd service file) but no success

Thanks you very much for help