ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
13.89k stars 3.4k forks source link

AWX is not processing the jobs #14299

Open klepiz opened 1 year ago

klepiz commented 1 year ago

Please confirm the following

Bug Summary

This is the error that I am getting after start a new job on awx

--- kubectl -n awx logs -f deployment/awx-task -c awx-task
2023-07-21 14:47:14,729 ERROR    [01683df8a8d740239e0f28d8a1be09f0] awx.main.dispatch Worker failed to run task awx.main.scheduler.tasks.task_manager(*[], **{}
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 103, in perform_work
    result = self.run_callable(body)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 78, in run_callable
    return _call(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/scheduler/tasks.py", line 25, in task_manager
    run_manager(TaskManager, "task")
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/scheduler/tasks.py", line 20, in run_manager
    manager().schedule()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/scheduler/task_manager.py", line 136, in schedule
    self._schedule()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/scheduler/task_manager.py", line 57, in inner
    result = func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/scheduler/task_manager.py", line 741, in _schedule
    self.process_tasks()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/scheduler/task_manager.py", line 710, in process_tasks
    self.process_pending_tasks(pending_tasks)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/scheduler/task_manager.py", line 57, in inner
    result = func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/scheduler/task_manager.py", line 616, in process_pending_tasks
    task, instance_group_name=self.controlplane_ig.name, impact=control_impact, capacity_type='control'
AttributeError: 'NoneType' object has no attribute 'name'

On the awx-web container

sh-5.1$ awx-manage run_dispatcher
2023-07-26 19:13:48,217 WARNING  [-] awx.main.dispatch.periodic periodic beat started
2023-07-26 19:13:48,245 DEBUG    [-] awx.main.dispatch scaling up worker pid:613
2023-07-26 19:13:48,254 DEBUG    [-] awx.main.dispatch scaling up worker pid:614
2023-07-26 19:13:48,260 DEBUG    [-] awx.main.dispatch scaling up worker pid:615
2023-07-26 19:13:48,267 DEBUG    [-] awx.main.dispatch scaling up worker pid:616
2023-07-26 19:13:48,270 INFO     [-] awx.main.dispatch Running worker dispatcher listening to queues ['tower_broadcast_all', 'tower_settings_change', 'awx-task-6fbff4f477-rp5jm']
2023-07-26 19:13:48,280 ERROR    [-] awx.main.dispatch Encountered unhandled error in dispatcher main loop
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/base.py", line 181, in run
    self.worker.on_start()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 141, in on_start
    dispatch_startup()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/system.py", line 87, in dispatch_startup
    write_receptor_config()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 680, in write_receptor_config
    with lock:
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/filelock/_api.py", line 220, in __enter__
    self.acquire()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/filelock/_api.py", line 173, in acquire
    self._acquire()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/filelock/_unix.py", line 35, in _acquire
    fd = os.open(self._lock_file, open_mode)
PermissionError: [Errno 13] Permission denied: '/etc/receptor/receptor.conf.lock'
Traceback (most recent call last):
  File "/usr/bin/awx-manage", line 8, in <module>
    sys.exit(manage())
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/__init__.py", line 200, in manage
    execute_from_command_line(sys.argv)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
    utility.execute()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/__init__.py", line 436, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/base.py", line 412, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/base.py", line 458, in execute
    output = self.handle(*args, **options)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/management/commands/run_dispatcher.py", line 81, in handle
    consumer.run()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/base.py", line 181, in run
    self.worker.on_start()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 141, in on_start
    dispatch_startup()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/system.py", line 87, in dispatch_startup
    write_receptor_config()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 680, in write_receptor_config
    with lock:
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/filelock/_api.py", line 220, in __enter__
    self.acquire()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/filelock/_api.py", line 173, in acquire
    self._acquire()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/filelock/_unix.py", line 35, in _acquire
    fd = os.open(self._lock_file, open_mode)
PermissionError: [Errno 13] Permission denied: '/etc/receptor/receptor.conf.lock'

For this issue above ^ I tried to set even 777 for that file on awx-ee and awx-task container, but it didnt fix it.

AWX version

22.3.0

Select the relevant components

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

All the pods are running and I can have access to AWX using the brower, I can see all my jobs/workflows/projects from my old instance(we migrate the db data from another instacen with AWX v15.0.0) but when I try to run a job it get stuck on "pending" status.No logs on the UI or job_explanation db column, however when I was looking the awx-task container I can see that there is an issue on task_manager.py (Check bug summary error)

Expected results

Execution of the job on awx

Actual results

It seems like its not even start to processing the job Screen Shot 2023-07-26 at 14 27 04

Additional information

This is a almost fresh install for awx on k3s but we migrated our data from the previous AWX which is on v15.0.0, so I am aware that the psql migration could be causing this issue. The way we install that migration was

Running a kubectl log for that container I can see several errors

2023-06-07 22:18:52.325 UTC [172] ERROR:  column main_unifiedjob.execution_environment_id does not exist at character 234
2023-06-07 22:18:52.325 UTC [172] STATEMENT:  SELECT "main_unifiedjob"."id", "main_unifiedjob"."polymorphic_ctype_id", "main_unifiedjob"."modified", "main_unifiedjob"."description", "main_unifiedjob"."created_by_id", "main_unifiedjob"."modified_by_id", "main_unifiedjob"."name", "main_unifiedjob"."execution_environment_id", "main_unifiedjob"."old_pk", "main_unifiedjob"."emitted_events", "main_unifiedjob"."unified_job_template_id", "main_unifiedjob"."created", "main_unifiedjob"."launch_type", "main_unifiedjob"."schedule_id", "main_unifiedjob"."execution_node", "main_unifiedjob"."controller_node", "main_unifiedjob"."cancel_flag", "main_unifiedjob"."status", "main_unifiedjob"."failed", "main_unifiedjob"."started", "main_unifiedjob"."dependencies_processed", "main_unifiedjob"."finished", "main_unifiedjob"."canceled_on", "main_unifiedjob"."elapsed", "main_unifiedjob"."job_args", "main_unifiedjob"."job_cwd", "main_unifiedjob"."job_env", "main_unifiedjob"."job_explanation", "main_unifiedjob"."start_args", "main_unifiedjob"."result_traceback", "main_unifiedjob"."celery_task_id", "main_unifiedjob"."instance_group_id", "main_unifiedjob"."preferred_instance_groups_cache", "main_unifiedjob"."task_impact", "main_unifiedjob"."organization_id", "main_unifiedjob"."installed_collections", "main_unifiedjob"."ansible_version", "main_unifiedjob"."host_status_counts", "main_unifiedjob"."work_unit_id" FROM "main_unifiedjob" WHERE ("main_unifiedjob"."dependencies_processed" AND "main_unifiedjob"."status" IN ('pending', 'waiting', 'running') AND NOT ("main_unifiedjob"."launch_type" = 'sync') AND NOT ("main_unifiedjob"."polymorphic_ctype_id" = 69 AND "main_unifiedjob"."polymorphic_ctype_id" IS NOT NULL)) ORDER BY "main_unifiedjob"."created" ASC
2023-06-07 22:18:52.343 UTC [173] ERROR:  column main_unifiedjob.execution_environment_id does not exist at character 234
2023-06-07 22:18:52.343 UTC [173] STATEMENT:  SELECT "main_unifiedjob"."id", "main_unifiedjob"."polymorphic_ctype_id", "main_unifiedjob"."modified", "main_unifiedjob"."description", "main_unifiedjob"."created_by_id", "main_unifiedjob"."modified_by_id", "main_unifiedjob"."name", "main_unifiedjob"."execution_environment_id", "main_unifiedjob"."old_pk", "main_unifiedjob"."emitted_events", "main_unifiedjob"."unified_job_template_id", "main_unifiedjob"."created", "main_unifiedjob"."launch_type", "main_unifiedjob"."schedule_id", "main_unifiedjob"."execution_node", "main_unifiedjob"."controller_node", "main_unifiedjob"."cancel_flag", "main_unifiedjob"."status", "main_unifiedjob"."failed", "main_unifiedjob"."started", "main_unifiedjob"."dependencies_processed", "main_unifiedjob"."finished", "main_unifiedjob"."canceled_on", "main_unifiedjob"."elapsed", "main_unifiedjob"."job_args", "main_unifiedjob"."job_cwd", "main_unifiedjob"."job_env", "main_unifiedjob"."job_explanation", "main_unifiedjob"."start_args", "main_unifiedjob"."result_traceback", "main_unifiedjob"."celery_task_id", "main_unifiedjob"."instance_group_id", "main_unifiedjob"."preferred_instance_groups_cache", "main_unifiedjob"."task_impact", "main_unifiedjob"."organization_id", "main_unifiedjob"."installed_collections", "main_unifiedjob"."ansible_version", "main_unifiedjob"."host_status_counts", "main_unifiedjob"."work_unit_id" FROM "main_unifiedjob" WHERE (NOT "main_unifiedjob"."dependencies_processed" AND "main_unifiedjob"."status" IN ('pending') AND NOT ("main_unifiedjob"."launch_type" = 'sync') AND NOT ("main_unifiedjob"."polymorphic_ctype_id" = 69 AND "main_unifiedjob"."polymorphic_ctype_id" IS NOT NULL)) ORDER BY "main_unifiedjob"."created" ASC
2023-06-07 22:19:12.386 UTC [176] ERROR:  column main_schedule.execution_environment_id does not exist at character 313
2023-06-07 22:19:12.386 UTC [176] STATEMENT:  SELECT "main_schedule"."id", "main_schedule"."created", "main_schedule"."modified", "main_schedule"."description", "main_schedule"."created_by_id", "main_schedule"."modified_by_id", "main_schedule"."inventory_id", "main_schedule"."char_prompts", "main_schedule"."extra_data", "main_schedule"."survey_passwords", "main_schedule"."execution_environment_id", "main_schedule"."unified_job_template_id", "main_schedule"."name", "main_schedule"."enabled", "main_schedule"."dtstart", "main_schedule"."dtend", "main_schedule"."rrule", "main_schedule"."next_run" FROM "main_schedule" WHERE ("main_schedule"."enabled" AND "main_schedule"."next_run" < '2023-06-07T22:18:42.284075+00:00'::timestamptz) ORDER BY "main_schedule"."next_run" DESC NULLS LAST, "main_schedule"."id" ASC
2023-06-07 22:19:12.389 UTC [177] ERROR:  column main_unifiedjob.execution_environment_id does not exist at character 234

Is there maybe a better way to migrate data from an old version of awx (v15.0.0)? is this data migration could be the root cause of the issue on awx-task container?

Operator Logs

Pods status:

root@sl-tower:/home/sladmin# kubectl -n awx get pods
NAME                                               READY   STATUS    RESTARTS      AGE
awx-postgres-13-0                                  1/1     Running   1 (48d ago)   50d
awx-task-6fbff4f477-rp5jm                          4/4     Running   5 (48d ago)   49d
awx-web-5bd7f96b56-fmfqx                           3/3     Running   0             48d
awx-operator-controller-manager-74889d49c8-vc6b5   2/2     Running   0             48d

kustomization.yam file

resources:
- manager.yaml

generatorOptions:
  disableNameSuffixHash: true

configMapGenerator:
- files:
  - controller_manager_config.yaml
  name: awx-manager-config

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
images:
- name: controller
  newName: quay.io/ansible/awx-operator
  newTag: 2.2.1
fosterseth commented 1 year ago

did you see this guide for migration from older AWX to newer https://github.com/ansible/awx/blob/devel/tools/docker-compose/docs/data_migration.md

if that doesn't seem to work well, it might be best to just deploy fresh AWX, and then use awx collection, or tower-cli send/receive to import your old data