Open AlanCoding opened 3 years ago
I poked around and couldn't really replicate this, so it must be a strange/unexpected state receptor was in.
Hi,
We have seen this error in AAP 2.2 in our distributed environment. When this happens the job runs indefinitely (instead of failing for some network timeout e.g.). It is curious because not even the container started on execution node, but you can see the job running forever without stdout.
Also, if you run the command awx-manage run_dispatcher --status
you can see the dispatcher reporting that the job is being ran, when it was never even started.
When you query for the work results you can see it was never started:
[root@controller3 jmorenas]# receptorctl --socket /var/run/awx-receptor/receptor.sock work results mzjPtyLC
ERROR: Remote unit failed: Failed to restart: remote work had not previously started
Also, the jobs that happened to run into this issue was not possible to cancel them fro GUI. (If useful for others we cancel them running following commands from controller):
source /var/lib/awx/venv/awx/bin/activate
awx-manage shell_plus
from awx.main.models import UnifiedJob
unified_job_obj=UnifiedJob()
unified_job_obj.id=JOB_ID
unified_job_obj.delete()
quit()
deactivate
Thoughts? If there is some network issue and receptor can't start container on exec node, shouldn't the job fail with some log trace (instead of running nothing permanently)?
I submitted a work unit to a node that doesn't exist. Now I'm trying to get rid of the work unit.
The unit
vr4UKwkz
is the one of interest. I can't get rid of it.More details...
You can confirm on your own, this data is JSON parseable.
I'm not 100% sure how I got myself into this state. I think there was some network problem when I tried to submit to the
receptor-3
node.