StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.05k stars 745 forks source link

Killing action runner process should set running actions state to "abandoned" and not "failed" #3449

Open lakshmi-kannan opened 7 years ago

lakshmi-kannan commented 7 years ago

In systemd hosts like U16 and EL7, running something like

st2 run core.local cmd="sleep 180; echo covfefe" timeout=300 -a; sleep 30; sudo systemctl stop st2actionrunner; sleep 10; st2 execution list

shows action status as `failed and not abandoned.

NOTE: This does not happen always. So I believe there is a race of some sort in U16.

m4dcoder commented 7 years ago

The liveaction has already failed as a result of shutdown by the time the abandon process runs.

m4dcoder commented 7 years ago

The logic here https://github.com/StackStorm/st2/blob/master/contrib/runners/local_runner/local_runner.py#L177 for the local runner was triggered on u16 before the parent worker process it as abandoned.

m4dcoder commented 7 years ago

So, on u16, looks like SIGTERM is sent to the subprocess first and the local runner is treating it as failed status. Maybe the fix would be to set the state to ABANDONED at https://github.com/StackStorm/st2/blob/master/contrib/runners/local_runner/local_runner.py#L177 when SIGTERM is the exit code (or -15 per https://docs.python.org/2/library/subprocess.html?highlight=returncode#subprocess.Popen.returncode since SIGTERM is 15).

m4dcoder commented 7 years ago

Partial fix at https://github.com/StackStorm/st2/pull/3457