Open AlanCoding opened 2 years ago
@fosterseth I don't see a concrete resolution to this unless we reap all units shown in work list
which aren't in the job list. This might be disruptive to users if they try to launch jobs manually to test things. So let's weigh the options:
The latter option would involve receptor, because I don't see how to identify this right now.
"qdTvlWqS": {
"Detail": "Killed",
"ExtraData": {
"Expiration": "0001-01-01T00:00:00Z",
"LocalCancelled": true,
"LocalReleased": true,
"RemoteNode": "ec2-184-73-150-160.compute-1.amazonaws.com",
"RemoteParams": {
"params": "--private-data-dir=/tmp/awx_1557_5sv5d_do --delete"
},
"RemoteStarted": true,
"RemoteUnitID": "A1V5r1jA",
"RemoteWorkType": "ansible-runner",
"SignWork": true,
"TLSClient": "tls_client"
},
"State": 3,
"StateName": "Failed",
"StdoutSize": 12985,
"WorkType": "remote"
},
When I look at that, I can't "prove" that it was launched by AWX.
ISSUE TYPE
SUMMARY
Sometimes, a
receptorctl work list
will show a unit from a job which has been finished and deleted.This appears to come from a dispatcher restart of some sort, where supervisor sends a sigterm signal so that the finalization code does not get called.
After that, if the job is deleted in relatively short order, then it falls through the cracks of our work unit reaping logic.