ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
14.04k stars 3.42k forks source link

Dispatcher restarts can leave some work units in a state that can't be cleaned #11716

Open AlanCoding opened 2 years ago

AlanCoding commented 2 years ago
ISSUE TYPE
SUMMARY

Sometimes, a receptorctl work list will show a unit from a job which has been finished and deleted.

This appears to come from a dispatcher restart of some sort, where supervisor sends a sigterm signal so that the finalization code does not get called.

After that, if the job is deleted in relatively short order, then it falls through the cracks of our work unit reaping logic.

AlanCoding commented 2 years ago

@fosterseth I don't see a concrete resolution to this unless we reap all units shown in work list which aren't in the job list. This might be disruptive to users if they try to launch jobs manually to test things. So let's weigh the options:

The latter option would involve receptor, because I don't see how to identify this right now.

    "qdTvlWqS": {
        "Detail": "Killed",
        "ExtraData": {
            "Expiration": "0001-01-01T00:00:00Z",
            "LocalCancelled": true,
            "LocalReleased": true,
            "RemoteNode": "ec2-184-73-150-160.compute-1.amazonaws.com",
            "RemoteParams": {
                "params": "--private-data-dir=/tmp/awx_1557_5sv5d_do --delete"
            },
            "RemoteStarted": true,
            "RemoteUnitID": "A1V5r1jA",
            "RemoteWorkType": "ansible-runner",
            "SignWork": true,
            "TLSClient": "tls_client"
        },
        "State": 3,
        "StateName": "Failed",
        "StdoutSize": 12985,
        "WorkType": "remote"
    },

When I look at that, I can't "prove" that it was launched by AWX.