Open johanneskastl opened 2 years ago
Workaround for this is to cycle the database pod. The migration is then aborted, and not restarted.
Workaround for this is to cycle the database pod. The migration is then aborted, and not restarted.
Thanks for the tipp. But IMHO this requires a solution, not a workaround. It does not make a good impression if a newly installed AWX does not start properly... :-)
I absolutely agree with you!
I'm seeing the same thing but cycling the postgres pod and the awx pod doesn't fix the problem?
I am seeing a similar issue on a fresh install. It seems like there is no database getting deployed during a fresh install. At least awx-manage can not connect to the database pod and the pg_lsclusters doesn't show a cluster up and running. Cycling the db pod also doesn't solve the issue
db migrations are still required on a fresh install, since we haven't squashed all of our migration files into a singular file.
Do the migrations eventually run to completion (can take a while, give it a good 20 minutes), after which the UI will start being responsive?
Sometimes the message about waiting for migrations can be misleading. The script /usr/local/bin/wait-for-migrations
just runs awx-manage check
and awx-manage showmigrations
. Those can error for other reasons, like malformed LDAP config causing settings.py to throw a syntax error.
Do the migrations eventually run to completion (can take a while, give it a good 20 minutes), after which the UI will start being responsive?
I did not encounter this issue in my recent tests, but it seems like there might be some kind of race condition or moon phase or similar, so it might or might not happen... :-(
At least on my end I was able to troubleshoot this down to problems with containerd, not properly setting up the container NAT. The pods were simply not able to connect to each other. Switching back to legacy IP tables on k3s node hosts worked again.
In my case it also showed me the bad gateway error, and the pod showed me the message you indicated. I use Ubuntu 20 with k3s, but I had disabled IPv6 at the Kernel level.
So I re-enabled it and it no longer gave me problems, I have followed the recommendations it gives:
https://github.com/kurokobo/awx-on-k3s
It is very complete.
I'm seeing an error trying to deploy to a fresh eks cluster.
kubectl logs -f deployments/awx-operator-controller-manager -c awx-manager -n awx
-------------------------------------------------------------------------------
{"level":"info","ts":1658650589.8996763,"logger":"runner","msg":"Ansible-runner exited successfully","job":"7807791897404560431","name":"awx-demo","namespace":"awx"}
----- Ansible Task Status Event StdOut (awx.ansible.com/v1beta1, Kind=AWX, awx-demo/awx) -----
PLAY RECAP *********************************************************************
localhost : ok=66 changed=2 unreachable=0 failed=0 skipped=45 rescued=0 ignored=0
estsauver@Earls-MBP k8s % kubectl logs awx-demo-bcb97966d-j7rph -n awx -c awx-demo-web
[wait-for-migrations] Waiting for database migrations...
[wait-for-migrations] Attempt 1 of 30
[wait-for-migrations] Waiting 0.5 seconds before next attempt
[wait-for-migrations] Attempt 2 of 30
[wait-for-migrations] Waiting 1 seconds before next attempt
[wait-for-migrations] Attempt 3 of 30
[wait-for-migrations] Waiting 2 seconds before next attempt
[wait-for-migrations] Attempt 4 of 30
[wait-for-migrations] Waiting 4 seconds before next attempt
[wait-for-migrations] Attempt 5 of 30
[wait-for-migrations] Waiting 8 seconds before next attempt
[wait-for-migrations] Attempt 6 of 30
[wait-for-migrations] Waiting 16 seconds before next attempt
Hi Guys,
I've also got this issue deploying into k3s as per https://github.com/kurokobo/awx-on-k3s on Centos 8. I have tried cycling the database pod, I assume you just meant to delete it and have it recreate?
Hoping someone has some steps on how to rectify this :).
Cheers!
Wiped my v0.25.0 operator to install the new v0.28.0 release. Clean install did not work. Looked at the operator logs with no_log: set to "false". The following task failed:
TASK [installer : Create super user via Django if it doesn't exist.] ***********
I got the following trace:
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 219, in ensure_connection
self.connect()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 200, in connect
self.connection = self.get_new_connection(conn_params)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 187, in get_new_connection
connection = Database.connect(**conn_params)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/psycopg2/__init__.py", line 126, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: could not translate host name "awx-postgres" to address: Name or service not known
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/bin/awx-manage", line 8, in <module>
sys.exit(manage())
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/__init__.py", line 185, in manage
if (connection.pg_version // 10000) < 12:
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/connection.py", line 15, in __getattr__
return getattr(self._connections[self._alias], item)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/functional.py", line 48, in __get__
res = instance.__dict__[self.name] = self.func(instance)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 329, in pg_version
with self.temporary_connection():
File "/usr/lib64/python3.9/contextlib.py", line 119, in __enter__
return next(self.gen)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 603, in temporary_connection
with self.cursor() as cursor:
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 259, in cursor
return self._cursor()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 235, in _cursor
self.ensure_connection()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 219, in ensure_connection
self.connect()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 219, in ensure_connection
self.connect()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 200, in connect
self.connection = self.get_new_connection(conn_params)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 187, in get_new_connection
connection = Database.connect(**conn_params)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/psycopg2/__init__.py", line 126, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
django.db.utils.OperationalError: could not translate host name "awx-postgres" to address: Name or service not known
I assume that the service that points to the postgres got renamed from "awx-postgres" to "awx-postgres-13", but the task still references "awx-postgres". I created my own "awx-postgres" service as a workaround that redirects "awx-postgres" trafic to "awx-postgres-13" using externalName:
kind: Service
apiVersion: v1
metadata:
name: awx-postgres
namespace: awx
spec:
type: ExternalName
externalName: awx-postgres-13.awx.svc.cluster.local
Then after a while the migration succeeded. (shell into the pod and run: psql --user awx
and then \c awx
to see the tables). After a while I was able to see the login screen. You might need to wait some time though. You may or may not need to kill both postgres and awx pods if it doesn't work yet.
@phinx110
I think it's not a bug. AWX uses hostname for PSQL from Secret resource that created via Operator.
Operator 0.28.0
creates Secret with hostname <instance name>-postgres-<version>
correctly: https://github.com/ansible/awx-operator/blob/0.28.0/roles/installer/templates/secrets/postgres_secret.yaml.j2#L19
Wiped my v0.25.0 operator to install the new v0.28.0 release
I guess your old Secret resource with old hostname had reused since it had not wiped correctly. AWX Operator reuses Secret if it already exists.
@kurokobo This was indeed the case.
Now I have Wiped my cluster again and I have deleted all remaining secrets and configmaps (just to be sure) inside the awx namespace and reinstalled the entire stack. Now i get the following failure:
TASK [installer : Check if there are any super users defined.] *****************
task path: /opt/ansible/roles/installer/tasks/initialize_django.yml:2
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 219, in ensure_connection
self.connect()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 200, in connect
self.connection = self.get_new_connection(conn_params)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 187, in get_new_connection
connection = Database.connect(**conn_params)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/psycopg2/__init__.py", line 126, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: FATAL: password authentication failed for user "awx"
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/bin/awx-manage", line 8, in <module>
sys.exit(manage())
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/__init__.py", line 185, in manage
if (connection.pg_version // 10000) < 12:
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/connection.py", line 15, in __getattr__
return getattr(self._connections[self._alias], item)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/functional.py", line 48, in __get__
res = instance.__dict__[self.name] = self.func(instance)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 329, in pg_version
with self.temporary_connection():
File "/usr/lib64/python3.9/contextlib.py", line 119, in __enter__
return next(self.gen)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 603, in temporary_connection
with self.cursor() as cursor:
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 259, in cursor
return self._cursor()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 235, in _cursor
self.ensure_connection()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 219, in ensure_connection
self.connect()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 219, in ensure_connection
self.connect()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 200, in connect
self.connection = self.get_new_connection(conn_params)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
return func(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 187, in get_new_connection
connection = Database.connect(**conn_params)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/psycopg2/__init__.py", line 126, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
django.db.utils.OperationalError: FATAL: password authentication failed for user "awx"
TASK [installer : Create super user via Django if it doesn't exist.]
fails for the same reason but at different code lines.
P.S. I solved this by:
psql --user awx
ALTER USER awx WITH PASSWORD 'xxxxxxxxxxxxxxxxxxxx';
<--- pasword found in the "awx/awx-postgres-configuration" secret.@phinx110 Hmm, have you wiped actual data in the PV for PSQL before re-installation? AWX Operator can't update password for specified user if the user already exists in PSQL. I don't know what storage type is used for your PV, but sometimes wiping data in PV manually is required for some stolage type e.g. hostPath or NFS since the new PV is created with existing data files and the data will be reused by new PSQL instance.
@kurokobo
I'm not sure regarding the PV. I didn't want to tear down my current setup so I could test this out specifically because I need to get some work done. I tried installing the operator in another namespace to create a separate awx setup on my dev cluster but I got: Helm install failed: clusterroles.rbac.authorization.k8s.io "awx-operator-proxy-role" already exists
, because my first operator was still installed. I'll keep an eye on the fresh install scenario when I deploy to the staging server.
@kurokobo So I installed v0.29.0 operator and an awx instance on a fresh untouched server and it was successful. I did not need to do anything manually to get it working. I did had to wait a bit for it to come through.
Had the same experience deploying into k3s as per https://github.com/kurokobo/awx-on-k3s (tag 0.30.0) on Ubuntu 20.04. After applying the service workaround as suggested by @phinx110 awx gui started working. I guess the version number should be removed as it might cause future issues when moving to postgres v14/15 etc??
I also face this problem, when I use awx operator to deploy awx. I found postgresql user awx have no password , then I set the awx password as same as in the secret , next delete awx pod , after doing this, everything goes ok!
Hello,
Got a similar problem but with curious error about postgres (found in awx-controller logs):
File \"/var/lib/awx/venv/awx/lib64/python3.9/site-packages/psycopg/connection.py\", line 728, in connect", " raise ex.with_traceback(None)", "django.db.utils.OperationalError: connection is bad: Name or service not known"], "stdout": "", "stdout_lines": []}
Try to do an awx-manage create_preload_data and got this "connection is bad: Name or service not known" Curl is not working from awx-task or web container to postgres container @IP:5432
My pods: [root@cad-pod-01:~]# k get pod NAME READY STATUS RESTARTS AGE ansible-awx-postgres-13-0 1/1 Running 0 12m ansible-awx-web-84c8ff665-gxlft 3/3 Running 0 12m ansible-awx-task-5bbbc974dd-4gcwb 4/4 Running 0 12m awx-operator-controller-manager-7978c48674-b4csv 2/2 Running 0 12m
After 30 retry, pods restart
Initial configuration use a proxy, i unset it everywhere (env var and systemd service file) but no success
Thanks you very much for help
Please confirm the following
Summary
New installation of awx-controller and awx in a Kubernetes cluster.
Then installation finished, the pods are running, but reaching the website only returns
Bad Gateway
.For some reason, the awx-web pod tries to migrate the database (even if there is nothing to migrate, as it was just created).
AWX version
awx-controller 0.17.0
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
not relevant, as Kubernetes only
Operating system
not relevant, as Kubernetes only
Web browser
No response
Steps to reproduce
Then create a awx.yaml (mostly just reducing the limits/requests):
Apply the file, wait, wait a little more. Check all pods are running. Then curl the ingress, and you get
Bad Gateway
. Check the logs of the pod and you get something like this:Expected results
On new installations no database migration is necessary, hince it should not be executed.
Actual results
The UI is not reachable, due to
Bad Gateway
. And the pods are never finishing their database migration.Additional information
Even though it should not matter, this is a 3-node k3s cluster running v1.22.7+k3s1.