ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
13.5k stars 3.34k forks source link

Add try/except on release of work Unit and add force to workunit reaper #15129

Open tanganellilore opened 3 weeks ago

tanganellilore commented 3 weeks ago
SUMMARY

In case we have some issue beetween execution node and AWX, and AWX will not catch that execution node is not working well or nor reachave or simply delete workunit (I don't identify exactly the use case but appen to me in 24.2.0 with execution node and ansible runne 1.4.3), workflow still wait the running state. if we try to cancel the job/workflow via UI, we receive error below on awx-task pod and job never cancelled/stopped.

2024-04-23T09:11:30.367001675+02:00   File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/worker/task.py", line 103, in perform_work
    result = self.run_callable(body)
             ^^^^^^^^^^^^^^^^^^^^^^^
2024-04-23T09:11:30.367012067+02:00   File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/worker/task.py", line 78, in run_callable
2024-04-23T09:11:30.367015375+02:00     return _call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
2024-04-23T09:11:30.367023213+02:00   File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/tasks/system.py", line 687, in awx_receptor_workunit_reaper
    receptor_ctl.simple_command(f"work cancel {job.work_unit_id}")
2024-04-23T09:11:30.367031453+02:00   File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/receptorctl/socket_interface.py", line 83, in simple_command
2024-04-23T09:11:30.367035057+02:00     return self.read_and_parse_json()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-04-23T09:11:30.367042158+02:00   File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/receptorctl/socket_interface.py", line 60, in read_and_parse_json
    raise RuntimeError(text[7:])
RuntimeError: error cancelling remote unit:  unknown work unit wwXpmxdB

In thi PR i simply try/except the for cycle and demand the release to workunit reaper, where I put the force-release command instead of simple release.

I think that we need to force the release inside the for-cycle, because administrative_workunit_reaper check a lot of things on work unit side, that to me is not much sense because we already filter by ACTIVE_STATES on UnifiedJob filter.

If this is true, i can change it adding a force-relase command on exception in that way we are shure that works will be relased when cancel will be clicked on UI.

ISSUE TYPE
COMPONENT NAME
AWX VERSION
24.2.0
ADDITIONAL INFORMATION
tanganellilore commented 2 weeks ago

hi @fosterseth , reformat as per discussion above.

tanganellilore commented 3 days ago

@fosterseth with external and unstable execution node (i rebooted it multiple times) this happes one mor time, and with my last commint I should cover all things