Closed JG127 closed 2 years ago
@dasJ in the description of the issue @anxstj gave a couple of post above, it is mentioned. This temp postgres server instance is basically run when starting the postgres container to run initial migrations (https://github.com/docker-library/postgres/blob/master/11/docker-entrypoint.sh#L297-L302). And since this temp server is available and accepting connections and then killed/stopped shortly after, this causes the race condition.
Oof. But I can't think of a way to prevent this issue :/
But it's probably a lot less severe since the second loop only succeeds when connection as the awx
user succeeds which it only does when the postgres init script was already run (and not while it's runnig).
Yeah, i think it will be a lot less likely to happen (just running any command will already help here), but i think the real fix would be to make the awx-task container fail hard and exit if the db migration fails for any reason. It will be automatically restarted after exiting, the migration will be retried and will work the second time for sure. I have not had time yet to investigate on where to make the migration task inside the awx-task container fail, sry for not beeing super helpful here :(
I was hoping my set -e
does exactly that. If the processes tend to retry their migrations, I'm honestly too lazy to investigate this ;)
Had a lot of head scratching in the last week as I'd been getting this 100% repeatedly on my system. Two solutions are referenced in this issue, but bear in mind you have to leave about a 2½ minute gap after running the initial playbook (assuming the database is being created from scratch) before taking the remidial action (my system is a quad core atom C2550 with 8GB RAM running Ubuntu 20.04LTS)
It would be very useful to get this resolved since the out of the box experience with the simplest AWX configuration seems to be problematic (so likely inhibits adoption) and has been for a while (I tried from HEAD back to - i think 10 was the earliest I tried.
So to reiterate the two scriptable solutions are:
ansible-playbook -v -i inventory install.yml
sleep 150
docker exec awx_task awx-manage migrate
docker container restart awx_task
sleep 240
as mentioned earlier in thsi thread by a few people (although it gaves me an exception on the upgrade: psycopg2.errors.UndefinedColumn: column "authorize" of relation "main_credential" does not exist
but still seems to work) or:
ansible-playbook -v -i inventory install.yml
sleep 150
ansible-playbook -v -i inventory install.yml
sleep 240
Any chance this is fixed in 16.0.0 ?
No updates on this issue in a while. Going to assume it was fixed or not relevant for newer versions.
TASK [local_docker : Check for existing Postgres data (run from inside the container for access to file)] *** task path: /root/awx/installer/roles/local_docker/tasks/upgrade_postgres.yml:16 fatal: [localhost]: FAILED! => {"changed": true, "cmd": "docker run --rm -v '/var/lib/pgdocker:/var/lib/postgresql' centos:8 bash -c \"[[ -f /var/lib/postgresql/10/data/PG_VERSION ]] && echo 'exists'\"\n", "delta": "0:00:00.424388", "end": "2022-06-03 01:00:01.355937", "msg": "non-zero return code", "rc": 1, "start": "2022-06-03 01:00:00.931549", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} ...ignoring
I'm facing above mentioned issue while running ansible-playbook -i inventory install.yml file
ISSUE TYPE
SUMMARY
A fresh install of the 11.0.0 release doesn't work, even though the installation instructions are followed. There are sql errors and a recurring error about clustering.
ENVIRONMENT
STEPS TO REPRODUCE
The installation playbook runs w/o apparent errors. However, when checking the Docker compose logs there are loads of sql errors and cluster errors as shown below.
The procedure was repeated by commenting out the line "dockerhub_base=ansible" in the inventory file. Tot make certain the AWX Docker images are build locally and in sync with the installer. The very same errors happen.
EXPECTED RESULTS
No errors in the logs and a fully functional application.
ACTUAL RESULTS
The logs are filling with errors and the application is not fully functional. Sometimes I'm getting an angry potato logo. I've added a screenshot in attachment. What is it used for ? :-)
The odd thing however is when there is no angry potato logo the application seems to be functional (i.e. management jobs can be run successfully). Despite the huge number of errors in the logs.
When there is an angry potato logo I can log in but not run jobs.
ADDITIONAL INFORMATION
These SQL statement errors below are repeated very frequently: The relations "conf_setting" and "main_instance" do not exist.
This error about clustering is repeated very frequently: