ansible / receptor

Project Receptor is a flexible multi-service relayer with remote execution and orchestration capabilities linking controllers with executors across a mesh of nodes.
Other
160 stars 79 forks source link

Failure to release work that never started #363

Open AlanCoding opened 3 years ago

AlanCoding commented 3 years ago

I submitted a work unit to a node that doesn't exist. Now I'm trying to get rid of the work unit.

bash-4.4$ receptorctl work list
{'kUipgN1C': {'Detail': 'Running: PID 125',
              'ExtraData': {'Expiration': '0001-01-01T00:00:00Z',
                            'LocalCancelled': True,
                            'LocalReleased': True,
                            'RemoteNode': 'receptor-1',
                            'RemoteParams': {'params': '--private-data-dir=/tmp/pdd_wrapper_52_n9reowbp/awx_52_tc7bf77a'},
                            'RemoteStarted': True,
                            'RemoteUnitID': 'gDGK97a6',
                            'RemoteWorkType': 'ansible-runner',
                            'TLSClient': ''},
              'State': 1,
              'StateName': 'Running',
              'StdoutSize': 1476,
              'WorkType': 'remote'},
 'vr4UKwkz': {'Detail': 'Failed to restart: remote work had not previously '
                        'started',
              'ExtraData': {'Expiration': '0001-01-01T00:00:00Z',
                            'LocalCancelled': True,
                            'LocalReleased': True,
                            'RemoteNode': 'receptor-3',
                            'RemoteParams': {'params': '--private-data-dir=/tmp/pdd_wrapper_48_n0wq8gej/awx_48_p1smt56j'},
                            'RemoteStarted': False,
                            'RemoteUnitID': '',
                            'RemoteWorkType': 'ansible-runner',
                            'TLSClient': ''},
              'State': 3,
              'StateName': 'Failed',
              'StdoutSize': 0,
              'WorkType': 'remote'}}

The unit vr4UKwkz is the one of interest. I can't get rid of it.

bash-4.4$ receptorctl work cancel vr4UKwkz
vr4UKwkz: ERROR: Expecting value: line 1 column 1 (char 0)
bash-4.4$ receptorctl work release vr4UKwkz
vr4UKwkz: ERROR: Expecting value: line 1 column 1 (char 0)

More details...

bash-4.4$ cat /tmp/receptor/awx_1/vr4UKwkz/status
{"State":3,"Detail":"Failed to restart: remote work had not previously started","StdoutSize":0,"WorkType":"remote","ExtraData":{"RemoteNode":"receptor-3","RemoteWorkType":"ansible-runner","RemoteParams":{"params":"--private-data-dir=/tmp/pdd_wrapper_48_n0wq8gej/awx_48_p1smt56j"},"RemoteUnitID":"","RemoteStarted":false,"LocalCancelled":true,"LocalReleased":true,"TLSClient":"","Expiration":"0001-01-01T00:00:00Z"}}

You can confirm on your own, this data is JSON parseable.

I'm not 100% sure how I got myself into this state. I think there was some network problem when I tried to submit to the receptor-3 node.

fosterseth commented 3 years ago

I poked around and couldn't really replicate this, so it must be a strange/unexpected state receptor was in.

jangel97 commented 2 years ago

Hi,

We have seen this error in AAP 2.2 in our distributed environment. When this happens the job runs indefinitely (instead of failing for some network timeout e.g.). It is curious because not even the container started on execution node, but you can see the job running forever without stdout.

Also, if you run the command awx-manage run_dispatcher --status you can see the dispatcher reporting that the job is being ran, when it was never even started. When you query for the work results you can see it was never started:

[root@controller3 jmorenas]# receptorctl --socket /var/run/awx-receptor/receptor.sock work results mzjPtyLC
ERROR: Remote unit failed: Failed to restart: remote work had not previously started

Also, the jobs that happened to run into this issue was not possible to cancel them fro GUI. (If useful for others we cancel them running following commands from controller):

source /var/lib/awx/venv/awx/bin/activate
awx-manage shell_plus
from awx.main.models import UnifiedJob
unified_job_obj=UnifiedJob()
unified_job_obj.id=JOB_ID
unified_job_obj.delete()
quit()
deactivate

Thoughts? If there is some network issue and receptor can't start container on exec node, shouldn't the job fail with some log trace (instead of running nothing permanently)?