ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
13.9k stars 3.4k forks source link

AWX 19.1: jobs stuck in "running" for over 12 hours but not actually doing anything at all #10151

Open lemmy04 opened 3 years ago

lemmy04 commented 3 years ago
ISSUE TYPE
SUMMARY

Since yesterday my AWX does not work anymore - jobs get created and started but do not actually start i.e. no worker container is created, no actual work is done

ENVIRONMENT
STEPS TO REPRODUCE

I start a new job by manually launching either an SCM update or a job or workflow template

  1. job gets created
  2. job goes from pending to running
  3. NO new worker container is created in k3s
  4. NO actual work is done, no standard output on the job, job never finishes
EXPECTED RESULTS

I would expect the job to either work or fail at some point with some (meaningful) error message

ACTUAL RESULTS

job never actually starts - just sits there in "running" but nothing is actually happening.

ADDITIONAL INFORMATION

no errors on the K3s side of things - it looks more like AWX never actually even tries to talk to K3s anymore. BUT: ad-hoc jobs still work... this only affects templates and SCM updates (or maybe more, i have not tried every possible thing)

supremedacos commented 3 years ago

I see this happening as well, except my playbooks successfully start and complete and the k8s pod terminates cleanly, but the UI still shows the job in the Running state.

wenottingham commented 3 years ago

Hi - can you get the logs from the awx-task container related to the launch of the job? We'd need more information to determine what's going on here.

lemmy04 commented 3 years ago

... I haven't touched kubernetes before at all, until AWX 18 came out... what exactly do you need?

lemmy04 commented 3 years ago
2021-05-15 06:03:29,735 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.dispatch task ac491d10-dfb9-4384-a0f7-ff0bed326388 starting awx.main.tasks.cluster_node_heartbeat(*[])
2021-05-15 06:03:29,738 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.tasks Cluster node heartbeat task.
Instance Group already registered tower
(changed: True)
Instance Group already registered tower
(changed: True)
Instance Group already registered tower
(changed: True)
Instance Group already registered tower
(changed: True)
Instance Group already registered tower
(changed: True)
2021-05-15 06:03:29,944 WARNING  [7d8140cba46b442caf544b74ee77a3e5] awx.main.dispatch scaling up worker pid:4637
2021-05-15 06:03:29,930 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.dispatch task 069050b2-4f4a-4a40-ae4d-22e3d12e8692 starting awx.main.tasks.awx_periodic_scheduler(*[])
2021-05-15 06:03:29,929 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.dispatch task 61ef1476-a7b5-4d41-abba-e4406d2f1280 starting awx.main.scheduler.tasks.run_task_manager(*[])
2021-05-15 06:03:29,959 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.scheduler Running task manager.
2021-05-15 06:03:29,932 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.dispatch task b2f8cbd1-f0cb-4bfc-832f-774099510166 starting awx.main.tasks.awx_k8s_reaper(*[])
2021-05-15 06:03:30,028 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.dispatch task 5dc51ed6-7f26-4ea4-84fb-1967617ad64a starting awx.main.analytics.analytics_tasks.send_subsystem_metrics(*[])
2021-05-15 06:03:30,053 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.tasks Starting periodic scheduler
2021-05-15 06:03:30,064 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.tasks Last scheduler run was: 2021-05-15 06:02:59.623461+00:00
2021-05-15 06:03:30,134 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.scheduler Starting Scheduler
2021-05-15 06:03:30,225 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.tasks Checking for orphaned k8s pods for tower-1.
2021-05-15 06:03:30,327 DEBUG    [7d8140cba46b442caf544b74ee77a3e5] awx.main.scheduler Finishing Scheduler
2021-05-15 06:03:32,315 WARNING  [a7372df73eb140f4b35634cc1de6f3d2] awx.main.dispatch scaling down worker pid:4628
2021-05-15 06:03:32,316 DEBUG    [a7372df73eb140f4b35634cc1de6f3d2] awx.main.dispatch task 40e94a9d-513a-45dc-8cb1-f822ceed040d starting awx.main.scheduler.tasks.run_task_manager(*[])
2021-05-15 06:03:32,318 WARNING  [7d8140cba46b442caf544b74ee77a3e5] awx.main.dispatch worker exiting gracefully pid:4628
2021-05-15 06:03:32,329 DEBUG    [a7372df73eb140f4b35634cc1de6f3d2] awx.main.scheduler Running task manager.
Instance Group already registered tower
(changed: True)
Instance Group already registered tower
(changed: True)
2021-05-15 06:03:32,364 DEBUG    [a7372df73eb140f4b35634cc1de6f3d2] awx.main.scheduler Starting Scheduler
2021-05-15 06:03:32,640 DEBUG    [a7372df73eb140f4b35634cc1de6f3d2] awx.analytics.job_lifecycle projectupdate-10030 acknowledged
2021-05-15 06:03:32,690 WARNING  [a7372df73eb140f4b35634cc1de6f3d2] awx.main.scheduler No available capacity to run containerized <project_update 10030 (waiting)>.
2021-05-15 06:03:32,830 DEBUG    [a7372df73eb140f4b35634cc1de6f3d2] awx.analytics.job_lifecycle projectupdate-10030 waiting
2021-05-15 06:03:32,831 DEBUG    [a7372df73eb140f4b35634cc1de6f3d2] awx.main.scheduler project_update 10030 (waiting) consumed 0 capacity units from tower with prior total of 0
2021-05-15 06:03:32,832 DEBUG    [a7372df73eb140f4b35634cc1de6f3d2] awx.main.scheduler Finishing Scheduler
2021-05-15 06:03:32,894 DEBUG    [a7372df73eb140f4b35634cc1de6f3d2] awx.main.dispatch task 0c40df72-e14f-47fc-9f4c-79c6aea5503a starting awx.main.tasks.RunProjectUpdate(*[10030])
2021-05-15 06:03:33,561 DEBUG    [a7372df73eb140f4b35634cc1de6f3d2] awx.analytics.job_lifecycle projectupdate-10030 pre run
2021-05-15 06:03:33,581 DEBUG    [a7372df73eb140f4b35634cc1de6f3d2] awx.analytics.job_lifecycle projectupdate-10030 preparing playbook
2021-05-15 06:03:33,725 DEBUG    [a7372df73eb140f4b35634cc1de6f3d2] awx.analytics.job_lifecycle projectupdate-10030 running playbook
2021-05-15 06:03:42,842 DEBUG    [-] awx.main.commands.run_callback_receiver 27 is alive
2021-05-15 06:03:42,842 DEBUG    [-] awx.main.commands.run_callback_receiver 27 is alive
2021-05-15 06:03:49,893 DEBUG    [48f472980b8f4f2fa2dae02f4cd9e3dd] awx.main.dispatch task c315fc3a-bc0b-45bf-ab15-a733c305fc59 starting awx.main.scheduler.tasks.run_task_manager(*[])
2021-05-15 06:03:49,895 DEBUG    [48f472980b8f4f2fa2dae02f4cd9e3dd] awx.main.scheduler Running task manager.
2021-05-15 06:03:49,985 DEBUG    [48f472980b8f4f2fa2dae02f4cd9e3dd] awx.main.dispatch task 93290266-03db-4e82-8149-cfeabb045a4d starting awx.main.analytics.analytics_tasks.send_subsystem_metrics(*[])
2021-05-15 06:03:50,020 DEBUG    [48f472980b8f4f2fa2dae02f4cd9e3dd] awx.main.scheduler Starting Scheduler
2021-05-15 06:03:50,247 DEBUG    [48f472980b8f4f2fa2dae02f4cd9e3dd] awx.main.scheduler Finishing Scheduler
2021-05-15 06:04:00,015 DEBUG    [4d60fdea098b4adc8ddd989ee154d632] awx.main.dispatch task 32e4017f-f5cf-4ac8-8e7a-af2f14c32beb starting awx.main.tasks.awx_periodic_scheduler(*[])
2021-05-15 06:04:00,046 DEBUG    [4d60fdea098b4adc8ddd989ee154d632] awx.main.tasks Starting periodic scheduler
RESULT 2
OKREADY
2021-05-15 06:04:00,055 DEBUG    [4d60fdea098b4adc8ddd989ee154d632] awx.main.tasks Last scheduler run was: 2021-05-15 06:03:30.054996+00:00
2021-05-15 06:04:10,048 DEBUG    [82e2362b5b064dfca43b4eb30c004fc8] awx.main.dispatch task 2f5448fb-6d28-4935-b776-0866084fce7d starting awx.main.scheduler.tasks.run_task_manager(*[])
2021-05-15 06:04:10,052 DEBUG    [82e2362b5b064dfca43b4eb30c004fc8] awx.main.scheduler Running task manager.
2021-05-15 06:04:10,127 DEBUG    [82e2362b5b064dfca43b4eb30c004fc8] awx.main.scheduler Starting Scheduler
2021-05-15 06:04:10,219 DEBUG    [82e2362b5b064dfca43b4eb30c004fc8] awx.main.dispatch task ee9db357-b00a-4056-ac63-a34c9a89ce68 starting awx.main.analytics.analytics_tasks.send_subsystem_metrics(*[])
2021-05-15 06:04:10,424 DEBUG    [82e2362b5b064dfca43b4eb30c004fc8] awx.main.scheduler Finishing Scheduler
2021-05-15 06:04:30,227 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.dispatch task 4712463f-c29f-4fc2-9f1b-f52c28dabe8d starting awx.main.tasks.cluster_node_heartbeat(*[])
2021-05-15 06:04:30,230 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.tasks Cluster node heartbeat task.
Instance Group already registered tower
(changed: True)
Instance Group already registered tower
(changed: True)
Instance Group already registered tower
(changed: True)
Instance Group already registered tower
(changed: True)
Instance Group already registered tower
(changed: True)
Instance Group already registered tower
(changed: True)
2021-05-15 06:04:30,431 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.dispatch task dd11c919-b9bb-488e-9fa7-e4e2a28d1cc1 starting awx.main.tasks.awx_k8s_reaper(*[])
2021-05-15 06:04:30,442 WARNING  [780f73b06160467781e0d4b02f1463d9] awx.main.dispatch scaling up worker pid:4644
2021-05-15 06:04:30,431 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.dispatch task fb8d52f3-256d-49d4-9abd-709362f63fbb starting awx.main.tasks.awx_periodic_scheduler(*[])
2021-05-15 06:04:30,531 WARNING  [780f73b06160467781e0d4b02f1463d9] awx.main.dispatch scaling up worker pid:4646
2021-05-15 06:04:30,532 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.dispatch task c8a824dd-97c4-4d21-b330-c37b1a631f7e starting awx.main.scheduler.tasks.run_task_manager(*[])
2021-05-15 06:04:30,543 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.dispatch task 18217501-f7aa-4e35-90de-e49c14d71140 starting awx.main.analytics.analytics_tasks.send_subsystem_metrics(*[])
2021-05-15 06:04:30,552 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.scheduler Running task manager.
2021-05-15 06:04:30,630 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.tasks Starting periodic scheduler
2021-05-15 06:04:30,648 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.tasks Checking for orphaned k8s pods for tower-1.
2021-05-15 06:04:30,649 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.tasks Last scheduler run was: 2021-05-15 06:04:00.052460+00:00
2021-05-15 06:04:30,741 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.scheduler Starting Scheduler
2021-05-15 06:04:31,229 DEBUG    [780f73b06160467781e0d4b02f1463d9] awx.main.scheduler Finishing Scheduler

kubectl -n awx logs -f -- awx-task shows this during a scm update that never actually starts...

supremedacos commented 3 years ago

I think I had a similar problem with SCM updates, does your project contain a requirements.yml file? and if so did you create an Ansible Galaxy/Automation Hub API Token credential and assign it to your organization?

lemmy04 commented 3 years ago

No requirements.yml files anywhere in my projects, and my awx does have the one galaxy token that gets created and assigned during deployment.

lemmy04 commented 3 years ago

I just deleted my AWXS object on my k3s, and redeployed - same issue: scm jobs don't start at all.

shanemcd commented 3 years ago

Is there anything helpful in the logs of the in-pod awx-ee container?

lemmy04 commented 3 years ago

not really:

INFO 2021/05/21 18:03:22 Client connected to control service INFO 2021/05/21 18:03:22 Control service closed INFO 2021/05/21 18:03:22 Client disconnected from control service INFO 2021/05/21 18:03:22 Client connected to control service DEBUG 2021/05/21 18:03:24 Sending service advertisement: &{awx-9b7494dfb-jgk9m control 2021-05-21 18:03:24.48428354 +0000 UTC m=+38108.977573786 1 map[] []}

that is literally all there is.

lemmy04 commented 3 years ago

I have deleted the whole namespace in kubernetes and deleted the operator and deployed from scratch - and it's still the same: project sync jobs do not start at all, and there is nothing in the logs that would explain why.

lemmy04 commented 3 years ago

I've actually found something in the logs:

2021-05-28 11:25:17,565 WARNING  [6478de4b690b4854bb03466b65b1b70d] awx.main.scheduler No available capacity to run containerized <project_update 10116 (waiting)>.
2021-05-28 11:25:17,720 DEBUG    [6478de4b690b4854bb03466b65b1b70d] awx.analytics.job_lifecycle projectupdate-10116 waiting

"No available capacity to run ..." - where does that come from, and why does it affect only SCM updates?

lemmy04 commented 3 years ago

I've reconfigured my AWX to have much bigger resource limits but I still get this message:

2021-06-01 06:05:35,058 WARNING [d03db904726b4c5abc80773ea8897714] awx.main.scheduler No available capacity to run containerized <project_update 10127 (waiting)>.

and the scm update is not happening.

lemmy04 commented 3 years ago

there is also a file not found error on awx-ee:

ERROR 2021/06/01 06:05:17 Error updating status file /tmp/receptor/awx-5f9c75f6b6-ks7jz/K72xh22I/status: open /tmp/receptor/awx-5f9c75f6b6-ks7jz/K72xh22I/status.lock: no such file or directory.

When I look inside the actual container I find that there is no file by that name.

instead there is /tmp/receptor/awx-5f9c75f6b6-ks7jz/Zg7PbqHR/status

JRNC commented 3 years ago

This sounds like it could be the same issue that some of us are having in issue #10489

lemmy04 commented 3 years ago

looks pretty much like the same thing. Can't really verify though - I've given up on AWX > 17.1 for now, and since the deployment with database migration is also broken I have no way to spin up a latest version AWX in my kubernetes to get enough history for it to trigger.

JRNC commented 3 years ago

Hey @lemmy04, I have actually been bypassing the database migration workflow and manually dumping the database and then restoring it. I've added some steps below. I hope this helps!

On old server: stop all containers except postgres first sudo docker exec awx_postgres pg_dump -U DATABASEUSERNAME -Ft -f backup.tar awx sudo docker cp awx_postgres:db.tar ./ sudo docker exec awx_postgres rm db.tar Send the tarball to the new server and then delete the copy that was just created on the old server

On new server after deploying a bare install of AWX: (make sure you run the below commands against the appropriate namespace) kubectl scale --replicas=0 deployment awx kubectl scale --replicas=0 deployment awx-operator kubectl exec -it awx-postgres-0 -- /bin/bash psql -U DATABASEUSERNAME postgres DROP DATABASE awx; CREATE DATABASE awx; exit exit kubectl cp db.tar awx/awx-postgres-0:/ kubectl exec -it awx-postgres-0 -- /bin/bash pg_restore -U DATABASEUSERNAME -Ft backup.tar -d awx rm backup.tar exit rm backup.tar kubectl scale --replicas=1 deployment awx-operator kubectl scale --replicas=1 deployment awx

stefanpinter commented 2 years ago

i also have a project update stuck at running the second time already. it is bad, because it also blocks other jobs (they are on pending) using awx 19.3.0

Identity added: /tmp/pdd_wrapper_7925_uf780rcs/awx_7925_8hqi355q/artifacts/7925/ssh_key_data (ansible@vpadm002) PLAY [Update source tree if necessary] * TASK [update project using git] **** ok: [localhost] TASK [Set the git repository version] ** ok: [localhost] TASK [Repository Version] ** ok: [localhost] => { "msg": "Repository Version 6d46b547c250e7595c8c5f62c91a273eccb888c7" } PLAY [Install content with ansible-galaxy command if necessary] **** TASK [Check content sync settings] *****

backaf commented 2 years ago

I have the same issue on 19.4.0 where project updates jobs get stuck into "Check content sync settings". It happens randomly. All scheduled jobs are failing because the projects can't be updated. This is a fresh install using the official AWX operator.

SudoNova commented 2 years ago

Same here

INFO2021/12/26 06:22:48 Client connected to control service
INFO2021/12/26 06:22:48 Control service closed
INFO2021/12/26 06:22:48 Client disconnected from control service
ERROR2021/12/26 06:22:49 Error updating status file /tmp/receptor/awx-5b4cff67f5-mh552/9LQAkSuW/status: open /tmp/receptor/awx-5b4cff67f5-mh552/9LQAkSuW/status.lock: no such file or directory.
ERROR2021/12/26 06:22:49 Error creating pod: context canceled

Though I don't get the Error creating pod part (what pod?), but the Error updating status file is common.

shanemcd commented 2 years ago

Are folks still seeing this?

navvisit commented 2 years ago

I've seen it last time on the 12th of April, shortly after the AWX upgrade to 20.1.0. It used to fail more often before, one thing I noticed was the issue gets a bit more stable if I delete the AWX pods.

I have a feeling it's better in 20.1.0, I can report back if I still see it.

PurpEE commented 1 year ago

We are currently seeing this in two instances after upgrading from 21.4 to 21.7. We tried to cancel it, and also redeployed the whole deployment. It still always appears.

Canceling it produces this output in the web container:

[awx-77dcb9477f-2r2vf awx-web] xx.xxx.xx.x - - [14/Oct/2022:09:46:44 +0000] "POST /api/v2/workflow_jobs/7136/cancel/ HTTP/1.1" 202 0 "https://awxurl.de/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0" "xx.xxx.x.xxx, xx.xxx.xx.xxx"                                                                                                         
[awx-77dcb9477f-2r2vf awx-web] [pid: 35|app: 0|req: 253/382] xx.xxx.xx.x () {68 vars in 2074 bytes} [Fri Oct 14 09:46:44 2022] POST /api/v2/workflow_jobs/7136/cancel/ => generated 0 bytes in 265 msecs (HTTP/1.1 202) 13 headers in 552 bytes (1 switches on core 0)

ee-container gives (same output as ee-container of all other pods):

[awx-77dcb9477f-g5h86 awx-ee] DEBUG 2022/10/14 09:46:46 Client connected to control service @                                                                                         
[awx-77dcb9477f-g5h86 awx-ee] DEBUG 2022/10/14 09:46:46 Client connected to control service @                                                                                                 
[awx-77dcb9477f-g5h86 awx-ee] DEBUG 2022/10/14 09:46:46 Control service closed                                                                                                                
[awx-77dcb9477f-g5h86 awx-ee] DEBUG 2022/10/14 09:46:46 Client disconnected from control service @                                                                                            
[awx-77dcb9477f-g5h86 awx-ee] DEBUG 2022/10/14 09:46:46 Control service closed                                                                                                                
[awx-77dcb9477f-g5h86 awx-ee] DEBUG 2022/10/14 09:46:46 Client disconnected from control service @                                                                                            
[awx-77dcb9477f-g5h86 awx-ee] DEBUG 2022/10/14 09:46:55 Sending service advertisement: &{awx-77dcb9477f-g5h86 control 2022-10-14 09:46:55.947685586 +0000 UTC m=+63426.817330184 1 map[type:Control Service] [{local false} {kubernetes-runtime-auth false} {kubernetes-incluster-auth false}]}

Manually trying to POST /api/v2/workflow_jobs/7136/cancel/ produces the same output as seen in the web container above.

Trying to delete it produces:

[awx-77dcb9477f-2r2vf awx-web] 2022-10-14 09:57:11,526 WARNING  [b41cc4b4803647f1bbfdf2df5ffa0784] awx.api.generics status 403 received by user foo.bar attempting to access /api/v2/workflow_jobs/7136/ from xx.xxx.xx.x                                                                                                                                                               
[awx-77dcb9477f-2r2vf awx-web] 2022-10-14 09:57:11,562 WARNING  [b41cc4b4803647f1bbfdf2df5ffa0784] django.request Forbidden: /api/v2/workflow_jobs/7136/                                     
[awx-77dcb9477f-2r2vf awx-web] 2022-10-14 09:57:11,562 WARNING  [b41cc4b4803647f1bbfdf2df5ffa0784] django.request Forbidden: /api/v2/workflow_jobs/7136/                                      
[awx-77dcb9477f-2r2vf awx-web] xx.xxx.xx.x - - [14/Oct/2022:09:57:11 +0000] "DELETE /api/v2/workflow_jobs/7136/ HTTP/1.1" 403 9343 "https://awxurl.de/api/v2/workflow_jobs/7136/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0" "xx.xxx.x.xxx, xx.xxx.xx.xxx"                                                                                 
[awx-77dcb9477f-2r2vf awx-web] [pid: 36|app: 0|req: 21/390] xx.xxx.xx.x () {70 vars in 2207 bytes} [Fri Oct 14 09:57:11 2022] DELETE /api/v2/workflow_jobs/7136/ => generated 9343 bytes in 212 msecs (HTTP/1.1 403) 15 headers in 709 bytes (1 switches on core 0)

In both cases the last job within the workflow shows the status: Error in awx. (Looking at status via /api/v2/jobs/7141/)

"status": "error",

It also shows the following result traceback:

"result_traceback": "Traceback (most recent call last):
  File \"/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py\", line 372, in _run_internal
    resultsock = receptor_ctl.get_work_results(self.unit_id, return_sockfile=True)
  File \"/var/lib/awx/venv/awx/lib64/python3.9/site-packages/receptorctl/socket_interface.py\", line 248, in get_work_results
    self.writestr(f\"work results {unit_id}\
\")
  File \"/var/lib/awx/venv/awx/lib64/python3.9/site-packages/receptorctl/socket_interface.py\", line 49, in writestr
    self._sockfile.flush()
  File \"/usr/lib64/python3.9/socket.py\", line 722, in write
    return self._sock.send(b)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File \"/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py\", line 537, in run
    res = receptor_job.run()
  File \"/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py\", line 268, in run
    res = self._run_internal(receptor_ctl)
  File \"/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py\", line 382, in _run_internal
    raise RuntimeError(detail)
RuntimeError
",

but the kubernetes job logs say the job has been canceled.

October 10th 2022, 11:05:35.922 {"status": "canceled", "runner_ident": "7141"} automation-job-7141-77cp7

Also here is the log of the pod at the end:

October 10th 2022, 11:01:12.2352022-10-10,230 INFO      awx.main.commands.run_callback_receiver Starting EOF event processing for Job 7141
October 10th 2022, 11:01:12.2352022-10-10,230 INFO      awx.main.commands.run_callback_receiver Starting EOF event processing for Job 7141
October 10th 2022, 11:01:12.4792022-10-10,477 DEBUG     awx.main.tasks.jobs job 7141 (running) finished running, producing 24 events.
October 10th 2022, 11:01:12.4792022-10-10,478 DEBUG     awx.analytics.job_lifecycle job-7141 post run
October 10th 2022, 11:01:12.4792022-10-10,477 ERROR     awx.main.tasks.jobs job 7141 (running) Exception occurred while running task
October 10th 2022, 11:01:12.7032022-10-10,697 DEBUG     awx.analytics.job_lifecycle job-7141 finalize run
October 10th 2022, 11:01:12.7712022-10-10,737 ERROR     awx.main.models.unified_jobs job 7141 (error) failed to emit channel msg about status change
October 10th 2022, 11:01:12.7832022-10-10,782 DEBUG     awx.main.tasks.system Executing error task id a492b12a-42d4-480a-95ce-0e3f1b5f7475, subtasks: [{'type': 'job', 'id': 7141}]
October 10th 2022, 11:01:12.7832022-10-10,775 WARNING   awx.main.dispatch job 7141 (error) encountered an error (rc=None), please see task stdout for details.
root3200 commented 10 months ago

Hi, if anyone is currently having the task error with the executing bug and is using awx operator, this may help resolve the error.

I installed awx operator with ks3, in a debian 12 virtual machine, the bug was generated when the virtual machine had assigned 1 cpu core, I still do not have very clear why the error is generated, but when I assigned 2 cpu cores awx tasks worked correctly without any problem, I hope it is helpful.