Vanilla install 11.0.0 fails

JG127 commented 4 years ago

ISSUE TYPE

Bug Report

SUMMARY

A fresh install of the 11.0.0 release doesn't work, even though the installation instructions are followed. There are sql errors and a recurring error about clustering.

ENVIRONMENT

AWX version: 11.0.0
AWX install method: docker on linux
Ansible version: 2.7.9
Python version: 2.7.17
Operating System: Linux Mint 19.3
Web Browser: n/a
Docker version: 19.03.8, build afacb8b7f0
Docker Compose version: 1.25.5

STEPS TO REPRODUCE

git clone https://github.com/ansible/awx.git
cd awx
git checkout 11.0.0
cd installer
rm -rf ~/.awx (make certain it is a clean install, empty database)
docker stop $(docker ps -q)
docker rm $(docker ps -qa)
docker rmi -f $(docker image ls -q)
docker system prune -f
virtualenv -p python2 venv
source venv/bin/activate
pip install ansible
pip install docker-compose
ansible-playbook -i inventory install.yml

The installation playbook runs w/o apparent errors. However, when checking the Docker compose logs there are loads of sql errors and cluster errors as shown below.

The procedure was repeated by commenting out the line "dockerhub_base=ansible" in the inventory file. Tot make certain the AWX Docker images are build locally and in sync with the installer. The very same errors happen.

EXPECTED RESULTS

No errors in the logs and a fully functional application.

ACTUAL RESULTS

The logs are filling with errors and the application is not fully functional. Sometimes I'm getting an angry potato logo. I've added a screenshot in attachment. What is it used for ? :-)

The odd thing however is when there is no angry potato logo the application seems to be functional (i.e. management jobs can be run successfully). Despite the huge number of errors in the logs.

When there is an angry potato logo I can log in but not run jobs.

ADDITIONAL INFORMATION

These SQL statement errors below are repeated very frequently: The relations "conf_setting" and "main_instance" do not exist.

awx_postgres | 2020-04-22 07:14:18.999 UTC [43] ERROR:  relation "conf_setting" does not exist at character 158
awx_postgres | 2020-04-22 07:14:18.999 UTC [43] STATEMENT:  SELECT "conf_setting"."id", "conf_setting"."created", "conf_setting"."modified", "conf_setting"."key", "conf_setting"."value", "conf_setting"."user_id" FROM "conf_setting" WHERE ("conf_setting"."key" = 'OAUTH2_PROVIDER' AND "conf_setting"."user_id" IS NULL) ORDER BY "conf_setting"."id" ASC  LIMIT 1

awx_postgres | 2020-04-22 07:14:19.153 UTC [43] ERROR:  relation "main_instance" does not exist at character 24
awx_postgres | 2020-04-22 07:14:19.153 UTC [43] STATEMENT:  SELECT (1) AS "a" FROM "main_instance" WHERE "main_instance"."hostname" = 'awx'  LIMIT 1

This error about clustering is repeated very frequently:


awx_web      | Traceback (most recent call last):
awx_web      |   File "/usr/bin/awx-manage", line 8, in <module>
awx_web      |     sys.exit(manage())
awx_web      |   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/__init__.py", line 152, in manage
awx_web      |     execute_from_command_line(sys.argv)
awx_web      |   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
awx_web      |     utility.execute()
awx_web      |   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/__init__.py", line 375, in execute
awx_web      |     self.fetch_command(subcommand).run_from_argv(self.argv)
awx_web      |   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/base.py", line 323, in run_from_argv
awx_web      |     self.execute(*args, **cmd_options)
awx_web      |   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/base.py", line 364, in execute
awx_web      |     output = self.handle(*args, **options)
awx_web      |   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/management/commands/run_wsbroadcast.py", line 128, in handle
awx_web      |     broadcast_websocket_mgr = BroadcastWebsocketManager()
awx_web      |   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/wsbroadcast.py", line 151, in __init__
awx_web      |     self.local_hostname = get_local_host()
awx_web      |   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/wsbroadcast.py", line 45, in get_local_host
awx_web      |     return Instance.objects.me().hostname
awx_web      |   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/managers.py", line 116, in me
awx_web      |     raise RuntimeError("No instance found with the current cluster host id")
awx_web      | RuntimeError: No instance found with the current cluster host id

awx_upgrading

jklare commented 3 years ago

@dasJ in the description of the issue @anxstj gave a couple of post above, it is mentioned. This temp postgres server instance is basically run when starting the postgres container to run initial migrations (https://github.com/docker-library/postgres/blob/master/11/docker-entrypoint.sh#L297-L302). And since this temp server is available and accepting connections and then killed/stopped shortly after, this causes the race condition.

dasJ commented 3 years ago

Oof. But I can't think of a way to prevent this issue :/ But it's probably a lot less severe since the second loop only succeeds when connection as the awx user succeeds which it only does when the postgres init script was already run (and not while it's runnig).

jklare commented 3 years ago

Yeah, i think it will be a lot less likely to happen (just running any command will already help here), but i think the real fix would be to make the awx-task container fail hard and exit if the db migration fails for any reason. It will be automatically restarted after exiting, the migration will be retried and will work the second time for sure. I have not had time yet to investigate on where to make the migration task inside the awx-task container fail, sry for not beeing super helpful here :(

dasJ commented 3 years ago

I was hoping my set -e does exactly that. If the processes tend to retry their migrations, I'm honestly too lazy to investigate this ;)

sxa commented 3 years ago

Had a lot of head scratching in the last week as I'd been getting this 100% repeatedly on my system. Two solutions are referenced in this issue, but bear in mind you have to leave about a 2½ minute gap after running the initial playbook (assuming the database is being created from scratch) before taking the remidial action (my system is a quad core atom C2550 with 8GB RAM running Ubuntu 20.04LTS)

It would be very useful to get this resolved since the out of the box experience with the simplest AWX configuration seems to be problematic (so likely inhibits adoption) and has been for a while (I tried from HEAD back to - i think 10 was the earliest I tried.

So to reiterate the two scriptable solutions are:

ansible-playbook -v -i inventory install.yml
sleep 150
docker exec awx_task awx-manage migrate
docker container restart awx_task
sleep 240

as mentioned earlier in thsi thread by a few people (although it gaves me an exception on the upgrade: psycopg2.errors.UndefinedColumn: column "authorize" of relation "main_credential" does not exist but still seems to work) or:

ansible-playbook -v -i inventory install.yml
sleep 150
ansible-playbook -v -i inventory install.yml
sleep 240

as per https://github.com/ansible/awx/issues/6931

JG127 commented 3 years ago

Any chance this is fixed in 16.0.0 ?

shanemcd commented 2 years ago

No updates on this issue in a while. Going to assume it was fixed or not relevant for newer versions.

Shivu1434 commented 2 years ago

TASK [local_docker : Check for existing Postgres data (run from inside the container for access to file)] *** task path: /root/awx/installer/roles/local_docker/tasks/upgrade_postgres.yml:16 fatal: [localhost]: FAILED! => {"changed": true, "cmd": "docker run --rm -v '/var/lib/pgdocker:/var/lib/postgresql' centos:8 bash -c \"[[ -f /var/lib/postgresql/10/data/PG_VERSION ]] && echo 'exists'\"\n", "delta": "0:00:00.424388", "end": "2022-06-03 01:00:01.355937", "msg": "non-zero return code", "rc": 1, "start": "2022-06-03 01:00:00.931549", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} ...ignoring

Shivu1434 commented 2 years ago

I'm facing above mentioned issue while running ansible-playbook -i inventory install.yml file

ansible / awx