Open kurokobo opened 1 year ago
The actual case that could be a problem by this issue:
In AWX, the job template was invoked on Execution Node; in terms of Receptor, ansible-runner worker
was invoked as command work on executor node.
During the job running, Execution Node is restarted due to some reason like virtualization host down, power outage, etc.
In this case, launched job in AWX is in running state forever until job timeout even though Ansible Runner is already down and the work is orphaned. AWX can't know that the job never be completed.
Description
If a remote worker node fails and is restarted, the running works will remain in the "Running" state and will not marked as "Completed" or "Failed" state forever.
If the worker process that should have been running does not exist after the node is restarted, shouldn't mark the work as "Failed" state?
Version
Using upstream
devel
image fromquay.io/ansible/receptor:devel
.Steps to reproduce
issue
Prepare files
foo.yml
bar.yml
docker-compose.yml
Prepare environment
Submit work
Restart executor node to simulate node failure
Ensure the work is in still running state and never be marked as completed or failed
Additional information
With the current implementation, the sender of work will wait forever for the completion of work that never ends.
It seems that in the node failure, the entire processes/threads of the Goroutine that monitors the process are terminated, so the workpiece is isolated forever.
I think it would be a more natural behavior to implement a work to fail if, after Receptor restarts, there is no worker process that references the unit directory that exists.