ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
13.9k stars 3.4k forks source link

Vanilla install 11.0.0 fails #6792

Closed JG127 closed 2 years ago

JG127 commented 4 years ago
ISSUE TYPE
SUMMARY

A fresh install of the 11.0.0 release doesn't work, even though the installation instructions are followed. There are sql errors and a recurring error about clustering.

ENVIRONMENT
STEPS TO REPRODUCE
  1. git clone https://github.com/ansible/awx.git
  2. cd awx
  3. git checkout 11.0.0
  4. cd installer
  5. rm -rf ~/.awx (make certain it is a clean install, empty database)
  6. docker stop $(docker ps -q)
  7. docker rm $(docker ps -qa)
  8. docker rmi -f $(docker image ls -q)
  9. docker system prune -f
  10. virtualenv -p python2 venv
  11. source venv/bin/activate
  12. pip install ansible
  13. pip install docker-compose
  14. ansible-playbook -i inventory install.yml

The installation playbook runs w/o apparent errors. However, when checking the Docker compose logs there are loads of sql errors and cluster errors as shown below.

The procedure was repeated by commenting out the line "dockerhub_base=ansible" in the inventory file. Tot make certain the AWX Docker images are build locally and in sync with the installer. The very same errors happen.

EXPECTED RESULTS

No errors in the logs and a fully functional application.

ACTUAL RESULTS

The logs are filling with errors and the application is not fully functional. Sometimes I'm getting an angry potato logo. I've added a screenshot in attachment. What is it used for ? :-)

The odd thing however is when there is no angry potato logo the application seems to be functional (i.e. management jobs can be run successfully). Despite the huge number of errors in the logs.

When there is an angry potato logo I can log in but not run jobs.

ADDITIONAL INFORMATION

These SQL statement errors below are repeated very frequently: The relations "conf_setting" and "main_instance" do not exist.

awx_postgres | 2020-04-22 07:14:18.999 UTC [43] ERROR:  relation "conf_setting" does not exist at character 158
awx_postgres | 2020-04-22 07:14:18.999 UTC [43] STATEMENT:  SELECT "conf_setting"."id", "conf_setting"."created", "conf_setting"."modified", "conf_setting"."key", "conf_setting"."value", "conf_setting"."user_id" FROM "conf_setting" WHERE ("conf_setting"."key" = 'OAUTH2_PROVIDER' AND "conf_setting"."user_id" IS NULL) ORDER BY "conf_setting"."id" ASC  LIMIT 1

awx_postgres | 2020-04-22 07:14:19.153 UTC [43] ERROR:  relation "main_instance" does not exist at character 24
awx_postgres | 2020-04-22 07:14:19.153 UTC [43] STATEMENT:  SELECT (1) AS "a" FROM "main_instance" WHERE "main_instance"."hostname" = 'awx'  LIMIT 1

This error about clustering is repeated very frequently:


awx_web      | Traceback (most recent call last):
awx_web      |   File "/usr/bin/awx-manage", line 8, in <module>
awx_web      |     sys.exit(manage())
awx_web      |   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/__init__.py", line 152, in manage
awx_web      |     execute_from_command_line(sys.argv)
awx_web      |   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
awx_web      |     utility.execute()
awx_web      |   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/__init__.py", line 375, in execute
awx_web      |     self.fetch_command(subcommand).run_from_argv(self.argv)
awx_web      |   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/base.py", line 323, in run_from_argv
awx_web      |     self.execute(*args, **cmd_options)
awx_web      |   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/base.py", line 364, in execute
awx_web      |     output = self.handle(*args, **options)
awx_web      |   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/management/commands/run_wsbroadcast.py", line 128, in handle
awx_web      |     broadcast_websocket_mgr = BroadcastWebsocketManager()
awx_web      |   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/wsbroadcast.py", line 151, in __init__
awx_web      |     self.local_hostname = get_local_host()
awx_web      |   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/wsbroadcast.py", line 45, in get_local_host
awx_web      |     return Instance.objects.me().hostname
awx_web      |   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/managers.py", line 116, in me
awx_web      |     raise RuntimeError("No instance found with the current cluster host id")
awx_web      | RuntimeError: No instance found with the current cluster host id

awx_upgrading

roedie commented 4 years ago

I can confirm this is happening. I just did a clean install as well. Same problem.

ryanpetrello commented 4 years ago

Yep, I'm able to reproduce this. Looking into it.

ryanpetrello commented 4 years ago

Actually, I can't reproduce this; as the error message above suggested, it was just migrating (which took a minute).

ryanpetrello commented 4 years ago

https://asciinema.org/a/4vZ7VJpMWFMx3tpA7CYNCGHZf

image
roedie commented 4 years ago

Here's my asciinema. This is on a cleanly installed Debian 10 host with docker and ansible.

https://asciinema.org/a/b1jgaeSFWiv6jHkFmPpO8aLTI?t=13

This is the vars.yml:

postgres_data_dir: "/srv/pgdocker"
docker_compose_dir: "/srv/awxcompose"
pg_password: "pgpass"
admin_password: "adminpass"
secret_key: "secretkey"
project_data_dir: "/srv/awx/projects"
bryanasdev000 commented 4 years ago

I'm trying to setup AWX using docker-compose, I'm having the same problems as OP, resulting in an infinite loop (30 minutes so far) of Ansible trying to perform the migrations. I will test again from scratch and report as soon as possible.

roedie commented 4 years ago

It never finishes the migrations on my hosts, at least, not for an hour. I still have it running so I can have a look again tomorrow ;-)

ryanpetrello commented 4 years ago

Do you see any errors related to migrations? What happens if you exec into the web container and run:

awx-manage migrate

by hand?

JG127 commented 4 years ago

Maybe unrelated to this issue, but release 11.1.0 has the same errors. After about 15' error messages it seems to resume its proper routine.

JG127 commented 4 years ago
$ docker-compose exec web bash
bash-4.4# awx-manage migrate
Operations to perform:
  Apply all migrations: auth, conf, contenttypes, main, oauth2_provider, sessions, sites, social_django, sso, taggit
Running migrations:
  No migrations to apply.

There must be a difference somewhere. The Python runtime environment perhaps ?

roedie commented 4 years ago

Hmmm, I get different output than @JG127.

root@awx-test:~# docker exec -ti  261e78c819ad bash
bash-4.4# awx-manage migrate
Operations to perform:
  Apply all migrations: auth, conf, contenttypes, main, oauth2_provider, sessions, sites, social_django, sso, taggit
Running migrations:
  Applying main.0001_initial... OK
  Applying main.0002_squashed_v300_release... OK
  Applying main.0003_squashed_v300_v303_updates... OK
  Applying main.0004_squashed_v310_release... OK
  Applying conf.0001_initial... OK
  Applying conf.0002_v310_copy_tower_settings... OK
  Applying main.0005_squashed_v310_v313_updates... OK
  Applying main.0006_v320_release... OK
  Applying main.0007_v320_data_migrations... OK
  Applying main.0008_v320_drop_v1_credential_fields... OK
  Applying main.0009_v322_add_setting_field_for_activity_stream... OK
  Applying main.0010_v322_add_ovirt4_tower_inventory... OK
  Applying main.0011_v322_encrypt_survey_passwords... OK
  Applying main.0012_v322_update_cred_types... OK
  Applying main.0013_v330_multi_credential... OK
  Applying auth.0002_alter_permission_name_max_length... OK
  Applying auth.0003_alter_user_email_max_length... OK
  Applying auth.0004_alter_user_username_opts... OK
  Applying auth.0005_alter_user_last_login_null... OK
  Applying auth.0006_require_contenttypes_0002... OK
  Applying auth.0007_alter_validators_add_error_messages... OK
  Applying auth.0008_alter_user_username_max_length... OK
  Applying auth.0009_alter_user_last_name_max_length... OK
  Applying auth.0010_alter_group_name_max_length... OK
  Applying auth.0011_update_proxy_permissions... OK
  Applying conf.0003_v310_JSONField_changes... OK
  Applying conf.0004_v320_reencrypt... OK
  Applying conf.0005_v330_rename_two_session_settings... OK
  Applying conf.0006_v331_ldap_group_type... OK
  Applying sessions.0001_initial... OK
  Applying main.0014_v330_saved_launchtime_configs... OK
  Applying main.0015_v330_blank_start_args... OK
  Applying main.0016_v330_non_blank_workflow... OK
  Applying main.0017_v330_move_deprecated_stdout... OK
  Applying main.0018_v330_add_additional_stdout_events... OK
  Applying main.0019_v330_custom_virtualenv... OK
  Applying main.0020_v330_instancegroup_policies... OK
  Applying main.0021_v330_declare_new_rbac_roles... OK
  Applying main.0022_v330_create_new_rbac_roles... OK
  Applying main.0023_v330_inventory_multicred... OK
  Applying main.0024_v330_create_user_session_membership... OK
  Applying main.0025_v330_add_oauth_activity_stream_registrar... OK
  Applying oauth2_provider.0001_initial... OK
  Applying main.0026_v330_delete_authtoken... OK
  Applying main.0027_v330_emitted_events... OK
  Applying main.0028_v330_add_tower_verify... OK
  Applying main.0030_v330_modify_application... OK
  Applying main.0031_v330_encrypt_oauth2_secret... OK
  Applying main.0032_v330_polymorphic_delete... OK
  Applying main.0033_v330_oauth_help_text... OK
2020-04-23 08:00:25,638 INFO     rbac_migrations Computing role roots..
2020-04-23 08:00:25,640 INFO     rbac_migrations Found 0 roots in 0.000213 seconds, rebuilding ancestry map
2020-04-23 08:00:25,640 INFO     rbac_migrations Rebuild ancestors completed in 0.000008 seconds
2020-04-23 08:00:25,640 INFO     rbac_migrations Done.
  Applying main.0034_v330_delete_user_role... OK
  Applying main.0035_v330_more_oauth2_help_text... OK
  Applying main.0036_v330_credtype_remove_become_methods... OK
  Applying main.0037_v330_remove_legacy_fact_cleanup... OK
  Applying main.0038_v330_add_deleted_activitystream_actor... OK
  Applying main.0039_v330_custom_venv_help_text... OK
  Applying main.0040_v330_unifiedjob_controller_node... OK
  Applying main.0041_v330_update_oauth_refreshtoken... OK
2020-04-23 08:00:29,220 INFO     rbac_migrations Computing role roots..
2020-04-23 08:00:29,225 INFO     rbac_migrations Found 0 roots in 0.000184 seconds, rebuilding ancestry map
2020-04-23 08:00:29,225 INFO     rbac_migrations Rebuild ancestors completed in 0.000010 seconds
2020-04-23 08:00:29,225 INFO     rbac_migrations Done.
  Applying main.0042_v330_org_member_role_deparent... OK
  Applying main.0043_v330_oauth2accesstoken_modified... OK
  Applying main.0044_v330_add_inventory_update_inventory... OK
  Applying main.0045_v330_instance_managed_by_policy... OK
  Applying main.0046_v330_remove_client_credentials_grant... OK
  Applying main.0047_v330_activitystream_instance... OK
  Applying main.0048_v330_django_created_modified_by_model_name... OK
  Applying main.0049_v330_validate_instance_capacity_adjustment... OK
  Applying main.0050_v340_drop_celery_tables... OK
  Applying main.0051_v340_job_slicing... OK
  Applying main.0052_v340_remove_project_scm_delete_on_next_update... OK
  Applying main.0053_v340_workflow_inventory... OK
  Applying main.0054_v340_workflow_convergence... OK
  Applying main.0055_v340_add_grafana_notification... OK
  Applying main.0056_v350_custom_venv_history... OK
  Applying main.0057_v350_remove_become_method_type... OK
  Applying main.0058_v350_remove_limit_limit... OK
  Applying main.0059_v350_remove_adhoc_limit... OK
  Applying main.0060_v350_update_schedule_uniqueness_constraint... OK
2020-04-23 08:00:44,638 DEBUG    awx.main.models.credential adding Machine credential type
2020-04-23 08:00:44,660 DEBUG    awx.main.models.credential adding Source Control credential type
2020-04-23 08:00:44,673 DEBUG    awx.main.models.credential adding Vault credential type
2020-04-23 08:00:44,683 DEBUG    awx.main.models.credential adding Network credential type
2020-04-23 08:00:44,692 DEBUG    awx.main.models.credential adding Amazon Web Services credential type
2020-04-23 08:00:44,702 DEBUG    awx.main.models.credential adding OpenStack credential type
2020-04-23 08:00:44,713 DEBUG    awx.main.models.credential adding VMware vCenter credential type
2020-04-23 08:00:44,723 DEBUG    awx.main.models.credential adding Red Hat Satellite 6 credential type
2020-04-23 08:00:44,733 DEBUG    awx.main.models.credential adding Red Hat CloudForms credential type
2020-04-23 08:00:44,743 DEBUG    awx.main.models.credential adding Google Compute Engine credential type
2020-04-23 08:00:44,753 DEBUG    awx.main.models.credential adding Microsoft Azure Resource Manager credential type
2020-04-23 08:00:44,763 DEBUG    awx.main.models.credential adding GitHub Personal Access Token credential type
2020-04-23 08:00:44,773 DEBUG    awx.main.models.credential adding GitLab Personal Access Token credential type
2020-04-23 08:00:44,784 DEBUG    awx.main.models.credential adding Insights credential type
2020-04-23 08:00:44,794 DEBUG    awx.main.models.credential adding Red Hat Virtualization credential type
2020-04-23 08:00:44,804 DEBUG    awx.main.models.credential adding Ansible Tower credential type
2020-04-23 08:00:44,814 DEBUG    awx.main.models.credential adding OpenShift or Kubernetes API Bearer Token credential type
2020-04-23 08:00:44,823 DEBUG    awx.main.models.credential adding CyberArk AIM Central Credential Provider Lookup credential type
2020-04-23 08:00:44,833 DEBUG    awx.main.models.credential adding Microsoft Azure Key Vault credential type
2020-04-23 08:00:44,843 DEBUG    awx.main.models.credential adding CyberArk Conjur Secret Lookup credential type
2020-04-23 08:00:44,854 DEBUG    awx.main.models.credential adding HashiCorp Vault Secret Lookup credential type
2020-04-23 08:00:44,864 DEBUG    awx.main.models.credential adding HashiCorp Vault Signed SSH credential type
  Applying main.0061_v350_track_native_credentialtype_source... OK
  Applying main.0062_v350_new_playbook_stats... OK
  Applying main.0063_v350_org_host_limits... OK
  Applying main.0064_v350_analytics_state... OK
  Applying main.0065_v350_index_job_status... OK
  Applying main.0066_v350_inventorysource_custom_virtualenv... OK
  Applying main.0067_v350_credential_plugins... OK
  Applying main.0068_v350_index_event_created... OK
  Applying main.0069_v350_generate_unique_install_uuid... OK
2020-04-23 08:00:48,324 DEBUG    awx.main.migrations Migrating inventory instance_id for gce to gce_id
  Applying main.0070_v350_gce_instance_id... OK
  Applying main.0071_v350_remove_system_tracking... OK
  Applying main.0072_v350_deprecate_fields... OK
  Applying main.0073_v360_create_instance_group_m2m... OK
  Applying main.0074_v360_migrate_instance_group_relations... OK
  Applying main.0075_v360_remove_old_instance_group_relations... OK
  Applying main.0076_v360_add_new_instance_group_relations... OK
  Applying main.0077_v360_add_default_orderings... OK
  Applying main.0078_v360_clear_sessions_tokens_jt... OK
  Applying main.0079_v360_rm_implicit_oauth2_apps... OK
  Applying main.0080_v360_replace_job_origin... OK
  Applying main.0081_v360_notify_on_start... OK
  Applying main.0082_v360_webhook_http_method... OK
  Applying main.0083_v360_job_branch_override... OK
  Applying main.0084_v360_token_description... OK
  Applying main.0085_v360_add_notificationtemplate_messages... OK
  Applying main.0086_v360_workflow_approval... OK
  Applying main.0087_v360_update_credential_injector_help_text... OK
  Applying main.0088_v360_dashboard_optimizations... OK
  Applying main.0089_v360_new_job_event_types... OK
  Applying main.0090_v360_WFJT_prompts... OK
  Applying main.0091_v360_approval_node_notifications... OK
  Applying main.0092_v360_webhook_mixin... OK
  Applying main.0093_v360_personal_access_tokens... OK
  Applying main.0094_v360_webhook_mixin2... OK
  Applying main.0095_v360_increase_instance_version_length... OK
  Applying main.0096_v360_container_groups... OK
  Applying main.0097_v360_workflowapproval_approved_or_denied_by... OK
  Applying main.0098_v360_rename_cyberark_aim_credential_type... OK
  Applying main.0099_v361_license_cleanup... OK
  Applying main.0100_v370_projectupdate_job_tags... OK
  Applying main.0101_v370_generate_new_uuids_for_iso_nodes... OK
  Applying main.0102_v370_unifiedjob_canceled... OK
  Applying main.0103_v370_remove_computed_fields... OK
  Applying main.0104_v370_cleanup_old_scan_jts... OK
  Applying main.0105_v370_remove_jobevent_parent_and_hosts... OK
  Applying main.0106_v370_remove_inventory_groups_with_active_failures... OK
  Applying main.0107_v370_workflow_convergence_api_toggle... OK
  Applying main.0108_v370_unifiedjob_dependencies_processed... OK
2020-04-23 08:01:26,793 DEBUG    rbac_migrations Migrating inventorysource to new organization field
2020-04-23 08:01:26,808 DEBUG    rbac_migrations Migrating jobtemplate to new organization field
2020-04-23 08:01:26,816 DEBUG    rbac_migrations Migrating project to new organization field
2020-04-23 08:01:26,822 DEBUG    rbac_migrations Migrating systemjobtemplate to new organization field
2020-04-23 08:01:26,822 DEBUG    rbac_migrations Class systemjobtemplate has no organization migration
2020-04-23 08:01:26,822 DEBUG    rbac_migrations Migrating workflowjobtemplate to new organization field
2020-04-23 08:01:26,829 DEBUG    rbac_migrations Migrating workflowapprovaltemplate to new organization field
2020-04-23 08:01:26,829 DEBUG    rbac_migrations Class workflowapprovaltemplate has no organization migration
2020-04-23 08:01:26,830 INFO     rbac_migrations Unified organization migration completed in 0.0366 seconds
2020-04-23 08:01:26,830 DEBUG    rbac_migrations Migrating adhoccommand to new organization field
2020-04-23 08:01:26,838 DEBUG    rbac_migrations Migrating inventoryupdate to new organization field
2020-04-23 08:01:26,846 DEBUG    rbac_migrations Migrating job to new organization field
2020-04-23 08:01:26,853 DEBUG    rbac_migrations Migrating projectupdate to new organization field
2020-04-23 08:01:26,861 DEBUG    rbac_migrations Migrating systemjob to new organization field
2020-04-23 08:01:26,861 DEBUG    rbac_migrations Class systemjob has no organization migration
2020-04-23 08:01:26,861 DEBUG    rbac_migrations Migrating workflowjob to new organization field
2020-04-23 08:01:26,869 DEBUG    rbac_migrations Migrating workflowapproval to new organization field
2020-04-23 08:01:26,869 DEBUG    rbac_migrations Class workflowapproval has no organization migration
2020-04-23 08:01:26,869 INFO     rbac_migrations Unified organization migration completed in 0.0391 seconds
2020-04-23 08:01:29,831 DEBUG    rbac_migrations No changes to role parents for 0 resources
2020-04-23 08:01:29,831 DEBUG    rbac_migrations Added parents to 0 roles
2020-04-23 08:01:29,831 DEBUG    rbac_migrations Removed parents from 0 roles
2020-04-23 08:01:29,832 INFO     rbac_migrations Rebuild parentage completed in 0.004574 seconds
  Applying main.0109_v370_job_template_organization_field... OK
  Applying main.0110_v370_instance_ip_address... OK
  Applying main.0111_v370_delete_channelgroup... OK
  Applying main.0112_v370_workflow_node_identifier... OK
  Applying main.0113_v370_event_bigint... OK
  Applying main.0114_v370_remove_deprecated_manual_inventory_sources... OK
  Applying oauth2_provider.0002_08_updates... OK
  Applying oauth2_provider.0003_auto_20160316_1503... OK
  Applying oauth2_provider.0004_auto_20160525_1623... OK
  Applying oauth2_provider.0005_auto_20170514_1141... OK
  Applying oauth2_provider.0006_auto_20171214_2232... OK
  Applying sites.0001_initial... OK
  Applying sites.0002_alter_domain_unique... OK
  Applying social_django.0001_initial... OK
  Applying social_django.0002_add_related_name... OK
  Applying social_django.0003_alter_email_max_length... OK
  Applying social_django.0004_auto_20160423_0400... OK
  Applying social_django.0005_auto_20160727_2333... OK
  Applying social_django.0006_partial... OK
  Applying social_django.0007_code_timestamp... OK
  Applying social_django.0008_partial_timestamp... OK
  Applying sso.0001_initial... OK
  Applying sso.0002_expand_provider_options... OK
  Applying taggit.0003_taggeditem_add_unique_index... OK
bash-4.4# 

After this I do get the login prompt, but somehow I cannot log in.

roedie commented 4 years ago

After the migrations I still get the crashing dispatchter:

2020-04-23 08:19:14,009 INFO spawned: 'dispatcher' with pid 25200
2020-04-23 08:19:15,011 INFO success: dispatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2020-04-23 08:19:17,893 WARNING  awx.main.dispatch.periodic periodic beat started
Traceback (most recent call last):
  File "/usr/bin/awx-manage", line 8, in <module>
    sys.exit(manage())
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/__init__.py", line 152, in manage
    execute_from_command_line(sys.argv)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
    utility.execute()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/__init__.py", line 375, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/base.py", line 323, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/base.py", line 364, in execute
    output = self.handle(*args, **options)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/management/commands/run_dispatcher.py", line 55, in handle
    reaper.reap()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/reaper.py", line 38, in reap
    (changed, me) = Instance.objects.get_or_register()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/managers.py", line 144, in get_or_register
    return (False, self.me())
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/managers.py", line 116, in me
    raise RuntimeError("No instance found with the current cluster host id")
RuntimeError: No instance found with the current cluster host id
2020-04-23 08:19:18,418 INFO exited: dispatcher (exit status 1; not expected)
JG127 commented 4 years ago

Adding the logs of the installer and the logs of the very first "docker-compose up" command.

initial_start.log.tar.gz ansible_install.log

I've got the impression the web service is waking up too early. It logs all sorts of errors for as long as the task service hasn't finished with the migrations.

Or rather, the task service wakes up late for the migrations. It's only at the end of the log file it begins to actually do something.

The net result however is that the system is functional. Albeit it took its time to get to that point. Maybe a network timeout somewhere ?

ryanpetrello commented 4 years ago

I agree, @JG127, this does sounds like sort of timing issue/race on startup. I've still been unable to reproduce, so if any of you find any additional clues, please let me know and I'm glad to help dig.

JG127 commented 4 years ago

The only thing coming to mind is the Python environment used to run the installer and the application. I always use a Virtualenv environment to work in when doing Python projects. Otherwise I'll end up using the libraries of the OS installed software.

This is the elaborate description of what I do to set up the runtime environment:

# Make certain no third-party configuration impacts the process
$ mv ~/.local ~/.local_disabled
$ sudo mv /etc/ansible/ansible.cfg /etc/ansible/ansible.cfg_disabled
$ sudo mv ~/.awx ~/.awx_disabled

# Clean Docker completely 
$ docker stop $(docker ps -q)
$ docker rm $(docker ps -qa)
$ docker rmi -f $(docker image ls -q)
$ docker system prune -f
$ docker builder prune -f
$ docker volume prune -f

# Create the runtime environment
$ virtualenv -p python2 venv
Running virtualenv with interpreter ~/.pyenv/shims/python2
Already using interpreter /usr/bin/python2
New python executable in ~/Projects/awx/venv/bin/python2
Also creating executable in ~/Projects/awx/venv/bin/python
Installing setuptools, pip, wheel...
done.
$ source venv/bin/activate
(venv) $ pip install ansible docker-compose
  ...
  ...
  ...
(venv) $ pip freeze
ansible==2.9.7
attrs==19.3.0
backports.shutil-get-terminal-size==1.0.0
backports.ssl-match-hostname==3.7.0.1
bcrypt==3.1.7
cached-property==1.5.1
certifi==2020.4.5.1
cffi==1.14.0
chardet==3.0.4
configparser==4.0.2
contextlib2==0.6.0.post1
cryptography==2.9.2
docker==4.2.0
docker-compose==1.25.5
dockerpty==0.4.1
docopt==0.6.2
enum34==1.1.10
functools32==3.2.3.post2
idna==2.9
importlib-metadata==1.6.0
ipaddress==1.0.23
Jinja2==2.11.2
jsonschema==3.2.0
MarkupSafe==1.1.1
paramiko==2.7.1
pathlib2==2.3.5
pycparser==2.20
PyNaCl==1.3.0
pyrsistent==0.16.0
PyYAML==5.3.1
requests==2.23.0
scandir==1.10.0
six==1.14.0
subprocess32==3.5.4
texttable==1.6.2
urllib3==1.25.9
websocket-client==0.57.0
zipp==1.2.0

(venv) $ python --version
Python 2.7.17

(venv) $ ansible --version
ansible 2.9.7
  config file = None
  configured module search path = [u'~/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = ~/Projects/awx/venv/local/lib/python2.7/site-packages/ansible
  executable location = ~/Projects/awx/venv/bin/ansible
  python version = 2.7.17 (default, Apr 15 2020, 17:20:14) [GCC 7.5.0]

(venv) $ docker-compose --version
docker-compose version 1.25.5, build unknown

(venv) $ docker --version
Docker version 19.03.8, build afacb8b7f0

(venv) $ docker system info
Client:
 Debug Mode: false

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 0
 Server Version: 19.03.8
 Storage Driver: overlay2
  Backing Filesystem: <unknown>
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.3.0-46-generic
 Operating System: Linux Mint 19.3
 OSType: linux
 Architecture: x86_64
 CPUs: 7
 Total Memory: 7.773GiB
 Name: workvm
 ID: DGRT:4RDB:6YC2:QTEB:U3IL:HDDQ:VCIT:HSUW:L344:KORB:SAPZ:MXIB
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support

(venv) $ cd installer

(venv) $ ansible-playbook -i inventory install.yml

PLAY [Build and deploy AWX] ******************************************************************************************************************************************

TASK [Gathering Facts] ***********************************************************************************************************************************************
ok: [localhost]

TASK [check_vars : include_tasks] ************************************************************************************************************************************
skipping: [localhost]

TASK [check_vars : include_tasks] ************************************************************************************************************************************
included: /home/jan/Projects/awx/installer/roles/check_vars/tasks/check_docker.yml for localhost

TASK [check_vars : postgres_data_dir should be defined] **************************************************************************************************************
ok: [localhost] => {
    "changed": false, 
    "msg": "All assertions passed"
}

TASK [check_vars : host_port should be defined] **********************************************************************************************************************
ok: [localhost] => {
    "changed": false, 
    "msg": "All assertions passed"
}

TASK [image_build : Set global version if not provided] **************************************************************************************************************
skipping: [localhost]

TASK [image_build : Verify awx-logos directory exists for official install] ******************************************************************************************
skipping: [localhost]

TASK [image_build : Copy logos for inclusion in sdist] ***************************************************************************************************************
skipping: [localhost]

TASK [image_build : Set sdist file name] *****************************************************************************************************************************
skipping: [localhost]

TASK [image_build : AWX Distribution] ********************************************************************************************************************************
skipping: [localhost]

TASK [image_build : Stat distribution file] **************************************************************************************************************************
skipping: [localhost]

TASK [image_build : Clean distribution] ******************************************************************************************************************************
skipping: [localhost]

TASK [image_build : Build sdist builder image] ***********************************************************************************************************************
skipping: [localhost]

TASK [image_build : Build AWX distribution using container] **********************************************************************************************************
skipping: [localhost]

TASK [image_build : Build AWX distribution locally] ******************************************************************************************************************
skipping: [localhost]

TASK [image_build : Set docker build base path] **********************************************************************************************************************
skipping: [localhost]

TASK [image_build : Set awx_web image name] **************************************************************************************************************************
skipping: [localhost]

TASK [image_build : Set awx_task image name] *************************************************************************************************************************
skipping: [localhost]

TASK [image_build : Ensure directory exists] *************************************************************************************************************************
skipping: [localhost]

TASK [image_build : Stage sdist] *************************************************************************************************************************************
skipping: [localhost]

TASK [image_build : Template web Dockerfile] *************************************************************************************************************************
skipping: [localhost]

TASK [image_build : Template task Dockerfile] ************************************************************************************************************************
skipping: [localhost]

TASK [image_build : Stage launch_awx] ********************************************************************************************************************************
skipping: [localhost]

TASK [image_build : Stage launch_awx_task] ***************************************************************************************************************************
skipping: [localhost]

TASK [image_build : Stage google-cloud-sdk.repo] *********************************************************************************************************************
skipping: [localhost]

TASK [image_build : Stage rsyslog.repo] ******************************************************************************************************************************
skipping: [localhost]

TASK [image_build : Stage rsyslog.conf] ******************************************************************************************************************************
skipping: [localhost]

TASK [image_build : Stage supervisor.conf] ***************************************************************************************************************************
skipping: [localhost]

TASK [image_build : Stage supervisor_task.conf] **********************************************************************************************************************
skipping: [localhost]

TASK [image_build : Stage settings.py] *******************************************************************************************************************************
skipping: [localhost]

TASK [image_build : Stage requirements] ******************************************************************************************************************************
skipping: [localhost]

TASK [image_build : Stage config watcher] ****************************************************************************************************************************
skipping: [localhost]

TASK [image_build : Stage Makefile] **********************************************************************************************************************************
skipping: [localhost]

TASK [image_build : Build base web image] ****************************************************************************************************************************
skipping: [localhost]

TASK [image_build : Build base task image] ***************************************************************************************************************************
skipping: [localhost]

TASK [image_build : Tag task and web images as latest] ***************************************************************************************************************
skipping: [localhost]

TASK [image_build : Clean docker base directory] *********************************************************************************************************************
skipping: [localhost]

TASK [image_push : Authenticate with Docker registry if registry password given] *************************************************************************************
skipping: [localhost]

TASK [image_push : Remove web image] *********************************************************************************************************************************
skipping: [localhost]

TASK [image_push : Remove task image] ********************************************************************************************************************************
skipping: [localhost]

TASK [image_push : Tag and push web image to registry] ***************************************************************************************************************
skipping: [localhost]

TASK [image_push : Tag and push task image to registry] **************************************************************************************************************
skipping: [localhost]

TASK [image_push : Set full image path for Registry] *****************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Generate broadcast websocket secret] **************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : fail] *********************************************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : include_tasks] ************************************************************************************************************************************
skipping: [localhost] => (item=openshift_auth.yml) 
skipping: [localhost] => (item=openshift.yml) 

TASK [kubernetes : include_tasks] ************************************************************************************************************************************
skipping: [localhost] => (item=kubernetes_auth.yml) 
skipping: [localhost] => (item=kubernetes.yml) 

TASK [kubernetes : Use kubectl or oc] ********************************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : set_fact] *****************************************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Record deployment size] ***************************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Set expected post-deployment Replicas value] ******************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Delete existing Deployment (or StatefulSet)] ******************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Get Postgres Service Detail] **********************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Template PostgreSQL Deployment (OpenShift)] *******************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Deploy and Activate Postgres (OpenShift)] *********************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Create Temporary Values File (Kubernetes)] ********************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Populate Temporary Values File (Kubernetes)] ******************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Deploy and Activate Postgres (Kubernetes)] ********************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Remove tempfile] **********************************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Set postgresql hostname to helm package service (Kubernetes)] *************************************************************************************
skipping: [localhost]

TASK [kubernetes : Wait for Postgres to activate] ********************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Check if Postgres 9.6 is being used] **************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Set new pg image] *********************************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Wait for change to take affect] *******************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Set env var for pg upgrade] ***********************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Wait for change to take affect] *******************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Set env var for new pg version] *******************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Wait for Postgres to redeploy] ********************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Wait for Postgres to finish upgrading] ************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Unset upgrade env var] ****************************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Wait for Postgres to redeploy] ********************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Set task image name] ******************************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Set web image name] *******************************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Determine Deployment api version] *****************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Render deployment templates] **********************************************************************************************************************
skipping: [localhost] => (item=None) 
skipping: [localhost] => (item=None) 
skipping: [localhost] => (item=None) 
skipping: [localhost] => (item=None) 
skipping: [localhost] => (item=None) 
skipping: [localhost]

TASK [kubernetes : Apply Deployment] *********************************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Delete any existing management pod] ***************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Template management pod] **************************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Create management pod] ****************************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Wait for management pod to start] *****************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Migrate database] *********************************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Check for Tower Super users] **********************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : create django super user if it does not exist] ****************************************************************************************************
skipping: [localhost]

TASK [kubernetes : update django super user password] ****************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Create the default organization if it is needed.] *************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Delete management pod] ****************************************************************************************************************************
skipping: [localhost]

TASK [kubernetes : Scale up deployment] ******************************************************************************************************************************
skipping: [localhost]

TASK [local_docker : Generate broadcast websocket secret] ************************************************************************************************************
ok: [localhost]

TASK [local_docker : Check for existing Postgres data] ***************************************************************************************************************
ok: [localhost]

TASK [local_docker : Record Postgres version] ************************************************************************************************************************
skipping: [localhost]

TASK [local_docker : Determine whether to upgrade postgres] **********************************************************************************************************
ok: [localhost]

TASK [local_docker : Set up new postgres paths pre-upgrade] **********************************************************************************************************
skipping: [localhost] => (item=~/.awx/pgdocker/10/data) 

TASK [local_docker : Stop AWX before upgrading postgres] *************************************************************************************************************
skipping: [localhost]

TASK [local_docker : Upgrade Postgres] *******************************************************************************************************************************
skipping: [localhost]

TASK [local_docker : Copy old pg_hba.conf] ***************************************************************************************************************************
skipping: [localhost]

TASK [local_docker : Remove old data directory] **********************************************************************************************************************
ok: [localhost]

TASK [local_docker : Export Docker web image if it isnt local and there isnt a registry defined] *********************************************************************
skipping: [localhost]

TASK [local_docker : Export Docker task image if it isnt local and there isnt a registry defined] ********************************************************************
skipping: [localhost]

TASK [local_docker : Set docker base path] ***************************************************************************************************************************
skipping: [localhost]

TASK [local_docker : Ensure directory exists] ************************************************************************************************************************
skipping: [localhost]

TASK [local_docker : Copy web image to docker execution] *************************************************************************************************************
skipping: [localhost]

TASK [local_docker : Copy task image to docker execution] ************************************************************************************************************
skipping: [localhost]

TASK [local_docker : Load web image] *********************************************************************************************************************************
skipping: [localhost]

TASK [local_docker : Load task image] ********************************************************************************************************************************
skipping: [localhost]

TASK [local_docker : Set full image path for local install] **********************************************************************************************************
skipping: [localhost]

TASK [local_docker : Set DockerHub Image Paths] **********************************************************************************************************************
ok: [localhost]

TASK [local_docker : Create ~/.awx/awxcompose directory] *************************************************************************************************************
changed: [localhost]

TASK [local_docker : Create Redis socket directory] ******************************************************************************************************************
changed: [localhost]

TASK [local_docker : Create Memcached socket directory] **************************************************************************************************************
changed: [localhost]

TASK [local_docker : Create Docker Compose Configuration] ************************************************************************************************************
changed: [localhost] => (item=environment.sh)
changed: [localhost] => (item=credentials.py)
changed: [localhost] => (item=docker-compose.yml)
changed: [localhost] => (item=nginx.conf)
changed: [localhost] => (item=redis.conf)

TASK [local_docker : Set redis config to other group readable to satisfy redis-server] *******************************************************************************
changed: [localhost]

TASK [local_docker : Render SECRET_KEY file] *************************************************************************************************************************
changed: [localhost]

TASK [local_docker : Start the containers] ***************************************************************************************************************************

(venv) $ cd ..
(venv) $ docker-compose logs -f
...
...
the errors
...
...

I repeat the process with Python3. Since Python2 is deprecated and AWX is using Python3 to run ansible. It's the very same routine as described above except for the virtualenv command.

Replace

virtualenv -p python2 venv

with

virtualenv -p python3 venv

Check of the releases

(venv) $ python --version
Python 3.6.9

(venv) $ ansible --version
ansible 2.9.7
  config file = None
  configured module search path = ['~/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = ~/Projects/awx/venv/lib/python3.6/site-packages/ansible
  executable location = ~/Projects/awx/venv/bin/ansible
  python version = 3.6.9 (default, Apr 18 2020, 01:56:04) [GCC 8.4.0]

(venv) $ pip freeze
ansible==2.9.7
attrs==19.3.0
bcrypt==3.1.7
cached-property==1.5.1
certifi==2020.4.5.1
cffi==1.14.0
chardet==3.0.4
cryptography==2.9.2
docker==4.2.0
docker-compose==1.25.5
dockerpty==0.4.1
docopt==0.6.2
idna==2.9
importlib-metadata==1.6.0
Jinja2==2.11.2
jsonschema==3.2.0
MarkupSafe==1.1.1
paramiko==2.7.1
pycparser==2.20
PyNaCl==1.3.0
pyrsistent==0.16.0
PyYAML==5.3.1
requests==2.23.0
six==1.14.0
texttable==1.6.2
urllib3==1.25.9
websocket-client==0.57.0
zipp==3.1.0

The very same errors pop up.

A long shot is that by some fluke you are using different versions of the involved Docker images. You might want to check the image id's to make certain those are the same.

$ docker image ls
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
redis               latest              a4d3716dbb72        13 hours ago        98.3MB
postgres            10                  b500168be260        17 hours ago        200MB
ansible/awx_task    11.0.0              83a56dfe4148        7 days ago          2.52GB
ansible/awx_web     11.0.0              ab9667094eac        7 days ago          2.48GB
memcached           alpine              acce7f7ac2ef        10 days ago         9.22MB

(why use both Redis and Memcached btw ?)

And a very long shot is that it makes a difference to do this in a virtual machine. I am using VirtualBox 6.1.16 on a Windows 10 host. As per company regulations.

JG127 commented 4 years ago

Maybe this will shed some light ...

While the task service is just sitting there it consumes 100% CPU. The processes:

# ps -ef 
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 11:14 ?        00:00:00 tini -- sh -c /usr/bin/launch_awx_task.sh
root         6     1  0 11:14 ?        00:00:00 bash /usr/bin/launch_awx_task.sh
root        69     6 95 11:15 ?        00:02:18 /var/lib/awx/venv/awx/bin/python3 /usr/bin/awx-manage migrate --noinput

Is there something I can try to find out what it is doing ?

ryanpetrello commented 4 years ago

What do the task logs/stdout say?

Maybe try something like straceor gdb on that awe-manage migrate process to see what its' doing?

JG127 commented 4 years ago

No logging at all. Not in the docker logs nor the log files in /var/log. This means the issue happens very early in the code. Before it logs something. It might really be stuck on a network connection after all; when the timeout mechanism does not rely on blocking i/o. Surely it's such a silly problem it doesn't even come to mind :-)

Naf3tsR commented 4 years ago

I'm experiencing the same problem. I' not able to start any release. I've tried 9.3.0 10.0.0 and 11.1.0. My error looks like this:

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/conf/settings.py", line 87, in _ctit_db_wrapper
    yield
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/conf/settings.py", line 415, in __getattr__
    value = self._get_local(name)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/conf/settings.py", line 358, in _get_local
    setting = Setting.objects.filter(key=name, user__isnull=True).order_by('pk').first()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/models/query.py", line 653, in first
    for obj in (self if self.ordered else self.order_by('pk'))[:1]:
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/models/query.py", line 274, in __iter__
    self._fetch_all()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/models/query.py", line 1242, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/models/query.py", line 55, in __iter__
    results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/models/sql/compiler.py", line 1131, in execute_sql
    cursor = self.connection.cursor()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/backends/base/base.py", line 256, in cursor
    return self._cursor()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/backends/base/base.py", line 233, in _cursor
    self.ensure_connection()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/backends/base/base.py", line 217, in ensure_connection
    self.connect()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/backends/base/base.py", line 217, in ensure_connection
    self.connect()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/backends/base/base.py", line 195, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/backends/postgresql/base.py", line 178, in get_new_connection
    connection = Database.connect(**conn_params)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/psycopg2/__init__.py", line 126, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
django.db.utils.OperationalError: FATAL:  no pg_hba.conf entry for host "172.20.0.6", user "awx", database "awx", SSL off

2020-04-24 13:23:30,507 ERROR    awx.conf.settings Database settings are not available, using defaults.
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/backends/base/base.py", line 217, in ensure_connection
    self.connect()
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/backends/base/base.py", line 195, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/backends/postgresql/base.py", line 178, in get_new_connection
    connection = Database.connect(**conn_params)
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/psycopg2/__init__.py", line 126, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: FATAL:  no pg_hba.conf entry for host "172.20.0.6", user "awx", database "awx", SSL off
JG127 commented 4 years ago

If this were a Java based system I would dump the threads or do a cpu profile. Python however is a bit new to me. Is there a way to cpu-profile a Python process ?

stmorim commented 4 years ago

I ran into this problem today as well. I was able to fix it by deleting all of the running docker containers and then running ansible-playbook -i inventory install.yml again. it took less than 2 minutes for the GUI to come up and I was able to log in.

Hope this helps.

Naf3tsR commented 4 years ago

I am getting the same error as @JG127 .

Starting with:

awx_web      | Traceback (most recent call last):
awx_web      |   File "/usr/bin/awx-manage", line 8, in <module>
awx_web      |     sys.exit(manage())
awx_web      |   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/__init__.py", line 152, in manage
awx_web      |     execute_from_command_line(sys.argv)
awx_web      |   File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
awx_web      |     utility.execute()

The fresh install does not work for me, for whatever reason.

In addition to that I get the error I've posted in an earlier post. There it says that pg_hba.conf is not configured properly and there is no entry for host "172.20.0.6"

Therefore, I've investigated pg_hba.conf. There you can find no allow entry for that host. Then I modified the pg_hba.conf like this, allowing everything in 172.16.0.0./12

# IPv4 local connections:
host    all             all             127.0.0.1/32            trust
host    all             all             172.16.0.0/12           trust

After saving the changes and starting the containers again, the problem is gone.

It looks like, because of the error in the init.py, pg_hba.conf gets not populated correctly.

Benoit-LAGUET commented 4 years ago

@Naf3tsR It works for me too ! Good job and thanks :)

JG127 commented 4 years ago

Yes ! :-) Can this be fixed via Docker (Compose) ? I'd rather not hack my way in.

sangrealest commented 4 years ago

@Naf3tsR Thanks, it works for me. I'm using the 11.2.0 version, still has the issue.

alishahrestani commented 4 years ago

Hi, I have same problem, it was resolved with using Postgresql 12.

JG127 commented 4 years ago

Upgrade to Postgresql 12 didn't help for me.

bryanasdev000 commented 4 years ago

I'm having the same problem in version 11.2.0, using the docker compose, however, in the second attempt it always works. I'm using PostgresSQL in compose as well. My logs are similar to @roedie.

Out of curiosity, did anyone have the problem using OpenShift or Kubernetes?

mstrent commented 4 years ago

Ran in to this with 11.2.0 today myself. As reported by others, deleted all the containers and ran the Ansible installer again and it worked.

zdriic commented 4 years ago

Hi, In my inventory file I have set two hosts to deploy AWX 11.2.0 on two nodes (local_docker mode) but I have this below error in web container AFTER I have change the /etc/tower/settings.py file in the web and task containers from myHost2 only (param : CLUSTER_HOST_ID = "awx" changed with "myHost2" ) :

In myHost1 this param is set to "awx" by default

> inventory file :

[tower] myHost1 ansible_ssh_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa myHost2 ansible_ssh_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa

[all:vars] ansible_python_interpreter="/usr/bin/env python"

> Error in the web container from myHost2 :

2020-06-02 12:11:09,744 WARNING awx.main.wsbroadcast Connection from myHost2 to awx failed: 'Cannot connect to host awx:443 ssl:False [Name or service not known]'.

> Error in the web container from myHost1 (awx) ;

2020-06-02 12:13:45,182 WARNING awx.main.wsbroadcast Connection from awx to myHost2 failed: 'Cannot connect to host myHost2:443 ssl:False [[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:897)]'.

==> Have you ever seen this error ?

bryanasdev000 commented 4 years ago

Hi, In my inventory file I have set two hosts to deploy AWX 11.2.0 on two nodes (local_docker mode) but I have this below error in web container AFTER I have change the /etc/tower/settings.py file in the web and task containers from myHost2 only (param : CLUSTER_HOST_ID = "awx" changed with "myHost2" ) :

In myHost1 this param is set to "awx" by default

> inventory file :

[tower] myHost1 ansible_ssh_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa myHost2 ansible_ssh_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa

[all:vars] ansible_python_interpreter="/usr/bin/env python"

> Error in the web container from myHost2 :

2020-06-02 12:11:09,744 WARNING awx.main.wsbroadcast Connection from myHost2 to awx failed: 'Cannot connect to host awx:443 ssl:False [Name or service not known]'.

> Error in the web container from myHost1 (awx) ;

2020-06-02 12:13:45,182 WARNING awx.main.wsbroadcast Connection from awx to myHost2 failed: 'Cannot connect to host myHost2:443 ssl:False [[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:897)]'.

==> Have you ever seen this error ?

The container doesnt have DNS to resolve to the other container, thats your problem and its a lot different from the other here, if you want to cluster AWX you can go to https://github.com/sujiar37/AWX-HA-InstanceGroup/issues/26.

zdriic commented 4 years ago

Hi, In my inventory file I have set two hosts to deploy AWX 11.2.0 on two nodes (local_docker mode) but I have this below error in web container AFTER I have change the /etc/tower/settings.py file in the web and task containers from myHost2 only (param : CLUSTER_HOST_ID = "awx" changed with "myHost2" ) : In myHost1 this param is set to "awx" by default

> inventory file :

[tower] myHost1 ansible_ssh_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa myHost2 ansible_ssh_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa [all:vars] ansible_python_interpreter="/usr/bin/env python"

> Error in the web container from myHost2 :

2020-06-02 12:11:09,744 WARNING awx.main.wsbroadcast Connection from myHost2 to awx failed: 'Cannot connect to host awx:443 ssl:False [Name or service not known]'.

> Error in the web container from myHost1 (awx) ;

2020-06-02 12:13:45,182 WARNING awx.main.wsbroadcast Connection from awx to myHost2 failed: 'Cannot connect to host myHost2:443 ssl:False [[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:897)]'. ==> Have you ever seen this error ?

The container doesnt have DNS to resolve to the other container, thats your problem and its a lot different from the other here, if you want to cluster AWX you can go to sujiar37/AWX-HA-InstanceGroup#26.

So we can't add two hosts in the inventory file ?? I thought we could...

bryanasdev000 commented 4 years ago

Hi, In my inventory file I have set two hosts to deploy AWX 11.2.0 on two nodes (local_docker mode) but I have this below error in web container AFTER I have change the /etc/tower/settings.py file in the web and task containers from myHost2 only (param : CLUSTER_HOST_ID = "awx" changed with "myHost2" ) : In myHost1 this param is set to "awx" by default

> inventory file :

[tower] myHost1 ansible_ssh_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa myHost2 ansible_ssh_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa [all:vars] ansible_python_interpreter="/usr/bin/env python"

> Error in the web container from myHost2 :

2020-06-02 12:11:09,744 WARNING awx.main.wsbroadcast Connection from myHost2 to awx failed: 'Cannot connect to host awx:443 ssl:False [Name or service not known]'.

> Error in the web container from myHost1 (awx) ;

2020-06-02 12:13:45,182 WARNING awx.main.wsbroadcast Connection from awx to myHost2 failed: 'Cannot connect to host myHost2:443 ssl:False [[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:897)]'. ==> Have you ever seen this error ?

The container doesnt have DNS to resolve to the other container, thats your problem and its a lot different from the other here, if you want to cluster AWX you can go to sujiar37/AWX-HA-InstanceGroup#26.

So we can't add two hosts in the inventory file ?? I thought we could...

AFAIK, you can, but I'm not sure that will cluster itself, besides you are editing manually the /etc/tower/settings.py. Without using a orchestration tool, the containers will not see the others beyond its own compose.

bpetit commented 4 years ago

Hi,

I have the exact same behavior with a fresh install in 12.0.0 with docker-compose, using python3 and debian 10 as a host. I'll add more info as I dig deeper.

EDIT

I had the same issue after I runned the installer.

I forced a migration with awx-manage migrate in the awx_task container. Then I only have this error recurring:

Traceback (most recent call last): File "/usr/bin/awx-manage", line 8, in sys.exit(manage()) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/init.py", line 154, in manage execute_from_command_line(sys.argv) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/init.py", line 381, in execute_from_command_line utility.execute() File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/init.py", line 375, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/base.py", line 323, in run_from_argv self.execute(*args, *cmd_options) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/base.py", line 364, in execute output = self.handle(args, **options) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/management/commands/run_dispatcher.py", line 55, in handle reaper.reap() File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/reaper.py", line 38, in reap (changed, me) = Instance.objects.get_or_register() File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/managers.py", line 158, in get_or_register return (False, self.me()) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/managers.py", line 116, in me raise RuntimeError("No instance found with the current cluster host id") RuntimeError: No instance found with the current cluster host id

At that point I couldn't log in to the webUI with the default admin credentials.

I forced the creation of a superuser with awx-manage createsuperuser and then I could login, the webapp seems to work fine so far, but I still have the previous error happening ever 10 seconds.

So something seems to be not working after the migration but I don't know yet what.

JG127 commented 4 years ago

Any chance 13.0.0 solves it ?

bpetit commented 4 years ago

I just tried a fresh install with the 13.0.0 (docker-compose mode on debian 10). It seems to give the "main_instance" error too:

2020-06-26 08:38:09,393 INFO exited: dispatcher (exit status 1; not expected) 2020-06-26 08:38:09,393 INFO exited: dispatcher (exit status 1; not expected) 2020-06-26 08:38:10,406 INFO spawned: 'dispatcher' with pid 160 2020-06-26 08:38:10,406 INFO spawned: 'dispatcher' with pid 160 2020-06-26 08:38:11,409 INFO success: dispatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2020-06-26 08:38:11,409 INFO success: dispatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) Traceback (most recent call last): File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/backends/utils.py", line 84, in _execute return self.cursor.execute(sql, params) psycopg2.errors.UndefinedTable: relation "main_instance" does not exist LINE 1: SELECT (1) AS "a" FROM "main_instance" WHERE "main_instance"... ^

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/bin/awx-manage", line 8, in sys.exit(manage()) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/init.py", line 154, in manage execute_from_command_line(sys.argv) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/init.py", line 381, in execute_from_command_line utility.execute() File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/init.py", line 375, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/base.py", line 323, in run_from_argv self.execute(*args, *cmd_options) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/base.py", line 364, in execute output = self.handle(args, **options) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/management/commands/run_dispatcher.py", line 55, in handle reaper.reap() File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/reaper.py", line 38, in reap (changed, me) = Instance.objects.get_or_register() File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/managers.py", line 144, in get_or_register return (False, self.me()) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/managers.py", line 100, in me if node.exists(): File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/models/query.py", line 766, in exists return self.query.has_results(using=self.db) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/models/sql/query.py", line 522, in has_results return compiler.has_results() File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/models/sql/compiler.py", line 1110, in has_results return bool(self.execute_sql(SINGLE)) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/models/sql/compiler.py", line 1140, in execute_sql cursor.execute(sql, params) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/backends/utils.py", line 67, in execute return self._execute_with_wrappers(sql, params, many=False, executor=self._execute) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/backends/utils.py", line 76, in _execute_with_wrappers return executor(sql, params, many, context) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/backends/utils.py", line 84, in _execute return self.cursor.execute(sql, params) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/utils.py", line 89, in exit raise dj_exc_value.with_traceback(traceback) from exc_value File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/backends/utils.py", line 84, in _execute return self.cursor.execute(sql, params) django.db.utils.ProgrammingError: relation "main_instance" does not exist LINE 1: SELECT (1) AS "a" FROM "main_instance" WHERE "main_instance"...

Then I ran awx-manage migrate which left me with:

2020-06-26 08:46:19,276 INFO exited: dispatcher (exit status 1; not expected) 2020-06-26 08:46:19,276 INFO exited: dispatcher (exit status 1; not expected) 2020-06-26 08:46:20,287 INFO spawned: 'dispatcher' with pid 801 2020-06-26 08:46:20,287 INFO spawned: 'dispatcher' with pid 801 2020-06-26 08:46:21,291 INFO success: dispatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2020-06-26 08:46:21,291 INFO success: dispatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2020-06-26 08:46:23,396 WARNING awx.main.dispatch.periodic periodic beat started Traceback (most recent call last): File "/usr/bin/awx-manage", line 8, in sys.exit(manage()) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/init.py", line 154, in manage execute_from_command_line(sys.argv) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/init.py", line 381, in execute_from_command_line utility.execute() File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/init.py", line 375, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/base.py", line 323, in run_from_argv self.execute(*args, *cmd_options) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/core/management/base.py", line 364, in execute output = self.handle(args, **options) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/management/commands/run_dispatcher.py", line 55, in handle reaper.reap() File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/reaper.py", line 38, in reap (changed, me) = Instance.objects.get_or_register() File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/managers.py", line 144, in get_or_register return (False, self.me()) File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/managers.py", line 102, in me raise RuntimeError("No instance found with the current cluster host id") RuntimeError: No instance found with the current cluster host id

bryanasdev000 commented 4 years ago

Tried with latest version (v13.0.0) on a clean env (Debian 10) and didn't have issues.

NOTE: Using an external PostgreSQL 11.8 database.

anxstj commented 4 years ago

I see this problem in my CI job that builds my awx containers. Approximately one out of 10 starts fail.

It seems that something is killing (or restarting) the postgres container quite early

root@runner-hgefapak-project-60-concurrent-1:/# docker logs -f awx_postgres
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/postgresql/data/pgdata ... ok
...
server started
CREATE DATABASE
...
2020-07-22 14:30:45.852 UTC [76] ERROR:  relation "django_migrations" does not exist at character 124
2020-07-22 14:30:45.852 UTC [76] STATEMENT:  SELECT "django_migrations"."id", "django_migrations"."app", "django_migrations"."name", "django_migrations"."applied" FROM "django_migrations" WHERE ("django_migrations"."app" = 'main' AND NOT ("django_migrations"."name"::text LIKE '%squashed%')) ORDER BY "django_migrations"."id" DESC  LIMIT 1
root@runner-hgefapak-project-60-concurrent-1:/#     # !!!! HERE something has stopped the container !!!!
root@runner-hgefapak-project-60-concurrent-1:/# docker logs -f awx_postgres
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
...

Which in turn causes the migration task inside the awx_task container to fail:

root@runner-hgefapak-project-60-concurrent-1:/# docker logs -f awx_task
Using /etc/ansible/ansible.cfg as config file
127.0.0.1 | SUCCESS => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/libexec/platform-python"
    },
    "changed": false,
    "elapsed": 0,
    "match_groupdict": {},
    "match_groups": [],
    "path": null,
    "port": 5432,
    "search_regex": null,
    "state": "started"
}
Using /etc/ansible/ansible.cfg as config file
127.0.0.1 | SUCCESS => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/libexec/platform-python"
    },
    "changed": false,
    "db": "awx"
}
Operations to perform:
  Apply all migrations: auth, conf, contenttypes, main, oauth2_provider, sessions, sites, social_django, sso, taggit
Running migrations:
  Applying contenttypes.0001_initial... OK
  Applying contenttypes.0002_remove_content_type_name... OK
  Applying taggit.0001_initial... OK
  Applying taggit.0002_auto_20150616_2121... OK
  Applying auth.0001_initial... OK
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib/python3.6/site-packages/django/db/backends/utils.py", line 82, in _execute
    return self.cursor.execute(sql)
psycopg2.OperationalError: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request

Restarting the awx_task container seams to restart the migration process which is then working. (supervisorctl restart all doesn't help).

So the question is, what is restarting the awx_postgres container?

From the postgres container entrypoint:

            docker_temp_server_start "$@"

            docker_setup_db
            docker_process_init_files /docker-entrypoint-initdb.d/*

            docker_temp_server_stop

If the migrate process starts during the postgresql initialization, then the connection will be dropped as soon as the temp_server stops.

johnrkriter commented 4 years ago

Came here while investigating an issue during Release 13.0.0 install issue that was also giving me ERROR: relation "conf_setting" does not exist at character 158

I ended up reverting to release 12.0.0 while troubleshooting assuming it was a dirty release. This is a fresh docker-compose install using containerized Postgres.

I was able to eventually get into the web UI for 12.0.0 using 2 steps.

  1. (Credit @JG127 ) I first exec into the awx_web container and ran awx-manage migrate. This completed the migration without errors.
  2. (Credit @anxstj) I then restarted awx_task container. I can now see that postgres appears happy

This appears to be a valid work-around for new docker-compose install at least with my config

tumluliu commented 4 years ago

ran into the same UndefinedTable: relation "main_instance" does not exist problem with 14.1.0 installed with docker-compose and vanilla settings. Fixed by

  1. running awx-manage migrate within awx_task container
  2. restarting awx_task container

as mentioned by @johnrkriter (thanks a lot!) But could not understand what is the major difficulty to resolve this

kakkotetsu commented 4 years ago

I had same problem , and it was resolved with same way... (Ubuntu 20.04.1 + AWX 14.1.0 + docker-compose clean install)

arashnikoo commented 3 years ago

The workaround works for me as well. Is this going to be fixed?

Nuttymoon commented 3 years ago

This issue is still around with AWX 15.0.0 on docker-compose deployment. The workaround of @johnrkriter works:

docker exec awx_task awx-manage migrate
docker container restart awx_task
JG127 commented 3 years ago

Pity is that nobody from the project seems interested... :-(

jklare commented 3 years ago

This looks like a race condition betweent the pg container and the aws-task container. Since i am not familiar with the project structure it will probably take me some time to find the right place to look :) I will update this as soon as i find something.

jklare commented 3 years ago

So i think the description of @anxstj is pretty accurate and complete, we now just need to figure out a good way to wait for the postges ini to finish before we start the migration. Does anybody have a good idea how to do that?

jklare commented 3 years ago

Judging from the issue itself, i think the best option to fix this is to make the aws-task container (or the script running in it) fail if the migration fails instead of trying to continue with something that will never succeed. The db migrations themselves should be idempotent, so just failing and starting fresh should be fine.

dasJ commented 3 years ago

Does it work after applying #8497 ?

jklare commented 3 years ago

@dasJ I think the patch could still run into the issue where the "while loop" exits because it successfully accesses the instance of the "docker_temp_server" that will be killed directly after. It makes it a lot more unlikely, but i do not see how the patch would completely avoid this case. It think the real fix here is to make the aws-task container completely fail on this error and start fresh (not sure if the "set -e" is enough here since iirc all errors are still retried when running "aws-migrate")

dasJ commented 3 years ago

@jklare wdym by "docker_temp_server"? I can neither find it in this repo nor when googeling