StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.07k stars 749 forks source link

Single command in the action is terminating the service #3752

Closed sibirajal closed 6 years ago

sibirajal commented 7 years ago

Hello Team,

I have created 2 actions in my workflow to check the ntpd service status and it starts the service if the service is not running.

It appears that when the restart action only contains single command "/sbin/service ntpd restart" and the service is getting terminated abruptly. If I add few more commands after the restart command then there is no problem with the action.

Can you please take a look at this bug and provide a fix?

Action for restart the service: $ cat ntp_admin_restart.yaml

---
description: 'Restart ntpd on a server'
enabled: true
entry_point: scripts/ntp_restart.sh
name: ntp_restart
runner_type: remote-shell-script
parameters:
  action:
    description: "Run as an action.  (Outputs structured data)"
    default: true
    immutable: true
    type: boolean
  dir:
    default: "/home/admin/"
    immutable: true
  cwd:
    default: "/home/admin/"
    immutable: true
  debug:
    description: "Turn on debug output"
    default: false
    type: boolean
  sudo:
    default: true
    immutable: true
  passphrase:
    default: "{{ st2kv.admin_passphrase | decrypt_kv}}"
    type: string
    required: true
  private_key: 
    default: "/home/admin/.ssh/id_rsa"
    required: true
  username: 
    default: "admin" 
    required: true

Script for above action:

$ cat scripts/ntp_restart.sh 
#!/bin/bash
/etc/init.d/ntpd start 
exit 0
# service ntpd status
ntpd dead but subsys locked

Action for check the service status: ntp_check.yaml

---
description: 'Check ntp on a server'
enabled: true
entry_point: scripts/check_service.sh
name: ntp_check
runner_type: remote-shell-script
parameters:
  action:
    description: "Run as an action.  (Outputs structured data)"
    default: true
    immutable: true
    type: boolean
  dir:
    default: "/home/admin/"
    immutable: true
  cwd:
    default: "/home/admin/"
    immutable: true
  debug:
    description: "Turn on debug output"
    default: false
    type: boolean
  sudo:
    default: true
    immutable: true
  passphrase:
    default: "{{ st2kv.admin_passphrase | decrypt_kv}}"
    type: string
    required: true
  private_key: 
    default: "/home/admin/.ssh/id_rsa"
    required: true
  username: 
    default: "admin" 
    required: true
#check ntpd process
pgrep ntpd 2>/dev/null 1>&2
 if [ $? != 0 ];then
    exit 1
 fi

cat workflows/dmin_ntp_workflow.yaml

---
version: '2.0'

ops.admin_ntp_procs_workflow:
  type: direct
  input:
    - hostname
  tasks:
    verify:
      action: ops.ntp_check 
      input:
        hosts: <% $.hostname %>
      on-error:
        - remediate
    remediate:
      action: ops.ntp_restart
      input:
        hosts: <% $.hostname %>
      on-complete:
        - postcheck
    postcheck:
      action: ops.ntp_check
      input:
        hosts: <% $.hostname %>

After above execution:

+--------------------------+------------------------+-----------+--------------------------+-------------------------------+
| id                       | status                 | task      | action                   | start_timestamp               |
+--------------------------+------------------------+-----------+--------------------------+-------------------------------+
| 59bfa6b42d0549041063f6f0 | failed (4s elapsed)    | verify    | ops.ntp_check   | Mon, 18 Sep 2017 10:57:56 UTC |
| 59bfa6b82d0549041063f6f2 | succeeded (9s elapsed) | remediate | ops.ntp_restart | Mon, 18 Sep 2017 10:58:00 UTC |
| 59bfa6c12d0549041063f6f4 | failed (4s elapsed)    | postcheck | ops.ntp_check   | Mon, 18 Sep 2017 10:58:09 UTC |
+--------------------------+------------------------+-----------+--------------------------+-------------------------------+
arm4b commented 7 years ago

What's your OS? Can you run the same bash scripts manually and then try running via st2 and compare output/result? Is there a difference in output/rc between manual run and st2 run?

I suspect this is a generic error related to ntpd service itself and there are a bunch of related issues could be found in search, just a few:

Please additionally check your server logs for more detailed error messages why ntpd failed.

sibirajal commented 7 years ago

I am running the ntpd start command in Centos 6.7 OS.

I've tried the same service start with other services and the problem is same.

The issue is getting fixed if I add few more commands after the service start command as below:

#!/bin/bash
/sbin/service exim start 
/sbin/service exim status
exit 0

This can be replicated always with any of the service command in the action script.

arm4b commented 7 years ago

So if you do service start from the StackStorm, - it crashes any target process/service? Or are you saying that Action performing service start reports non-zero exit status code?

Could you run the following on the target machine: A)

sudo service ntpd start; echo $?
su -c 'sudo service ntpd start; echo $?' admin

And the same command via StackStorm itself: B)

st2 run core.remote_sudo cmd='service ntpd start' private_key=/home/admin/.ssh/id_rsa username=admin passphrase=replace_with_your_passphrase hosts=replace_with_your_remote_host

and post here the output to understand the issue better.

sibirajal commented 7 years ago

The problem appears to be with the remote shell actions and it works when I start it manually. It didn't start when I start via Stackstorm remote_sudo action. However the same start command works when I add subsequent command after restart command.

It looks to me a bug with the remote command execution action especially only for starting the service.

Manual method:

$ sudo service ntpd start; echo $?
Starting ntpd:                                             [  OK  ]
0
$ sudo service ntpd status
ntpd (pid  28409) is running...

Stackstorm method:

$ st2 run core.remote_sudo cmd='/sbin/service ntpd start' private_key=/home/admin/.ssh/id_rsa username=admin passphrase="{{ st2kv.system.admin_passphrase | decrypt_kv}}" hosts='172.16.15.4' 
.
id: 59bfcb3a2d0549041063f74b
status: succeeded
parameters: 
  cmd: /sbin/service ntpd start
  hosts: 172.16.15.4
  passphrase: '********'
  private_key: '********'
  username: admin
result: 
  172.16.15.4:
    failed: false
    return_code: 0
    stderr: ''
"   stdout: "Starting ntpd: \e[60G[\e[0;32m  OK  \e[0;39m]
    succeeded: true

Adding another subsequent command works:

$ st2 run core.remote_sudo cmd='/sbin/service ntpd start;/sbin/service ntpd status' private_key=/home/admin/.ssh/id_rsa username=admin passphrase="{{ st2kv.system.admin_passphrase | decrypt_kv}}" hosts='172.16.15.4' 
    .
    id: 59bfcbc92d0549041063f757
    status: succeeded
    parameters: 
      cmd: /sbin/service ntpd start;/sbin/service ntpd status
      hosts: 172.16.15.4
      passphrase: '********'
      private_key: '********'
      username: admin
    result: 
      172.16.15.4:
        failed: false
        return_code: 0
        stderr: ''
        stdout: "Starting ntpd: \e[60G[\e[0;32m  OK  \e[0;39m]
    ntpd (pid  808) is running..."
        succeeded: true
arm4b commented 7 years ago

Trying to replicate, I created 2 CentOS 6.9 Vagrant VMs: one with currently latest st2 2.4.1 installation, another one is target for testing remote_sudo command.

I created admin users on both st2 machine and remote machine, respecting instructions from https://docs.stackstorm.com/install/deb.html#configure-ssh-and-sudo about configuring paswordless sudo and removing requiretty. I also encrypted private key with the passphrase trying to replicate your setup.

Now running remote_sudo commands on target CentOS6 box:

ntpd stop with remote_sudo

$ st2 run core.remote_sudo cmd='/sbin/service ntpd stop' hosts=192.168.10.130 private_key=/home/admin/.ssh/admin_rsa username=admin passphrase=123456
.
id: 59c5305d5d698e0d8b30fbb9
status: succeeded
parameters: 
  cmd: /sbin/service ntpd stop
  hosts: 192.168.10.130
  passphrase: '********'
  private_key: '********'
  username: admin
result: 
  192.168.10.130:
    failed: false
    return_code: 0
    stderr: ''
    stdout: "Shutting down ntpd: \e[60G[\e[0;32m  OK  \e[0;39m]\r"
    succeeded: true

ntpd status with remote_sudo

$ st2 run core.remote_sudo cmd='/sbin/service ntpd status' hosts=192.168.10.130 private_key=/home/admin/.ssh/admin_rsa username=admin passphrase=123456
.
id: 59c530645d698e0d8b30fbbc
status: failed
parameters: 
  cmd: /sbin/service ntpd status
  hosts: 192.168.10.130
  passphrase: '********'
  private_key: '********'
  username: admin
result: 
  192.168.10.130:
    failed: true
    return_code: 3
    stderr: ''
    stdout: ntpd is stopped
    succeeded: false

ntpd start when service is stopped

$ st2 run core.remote_sudo cmd='/sbin/service ntpd start' hosts=192.168.10.130 private_key=/home/admin/.ssh/admin_rsa username=admin passphrase=123456
.
id: 59c530985d698e0d8b30fbbf
status: succeeded
parameters: 
  cmd: /sbin/service ntpd start
  hosts: 192.168.10.130
  passphrase: '********'
  private_key: '********'
  username: admin
result: 
  192.168.10.130:
    failed: false
    return_code: 0
    stderr: ''
    stdout: "Starting ntpd: \e[60G[\e[0;32m  OK  \e[0;39m]\r"
    succeeded: true

ntpd status after start in previous command now reports running ntpd service

$ st2 run core.remote_sudo cmd='/sbin/service ntpd status' hosts=192.168.10.130 private_key=/home/admin/.ssh/admin_rsa username=admin passphrase=123456
.
id: 59c530ac5d698e0d8b30fbc2
status: succeeded
parameters: 
  cmd: /sbin/service ntpd status
  hosts: 192.168.10.130
  passphrase: '********'
  private_key: '********'
  username: admin
result: 
  192.168.10.130:
    failed: false
    return_code: 0
    stderr: ''
    stdout: ntpd (pid  8742) is running...
    succeeded: true

So in my setup ntpd is running fine after doing service start.

/var/log/messages during the execution:

Sep 22 15:57:27 localhost ntpd[8742]: ntpd exiting on signal 15
Sep 22 15:57:53 localhost ntpd[8911]: ntpd 4.2.6p5@1.2349-o Mon Feb  6 07:22:46 UTC 2017 (1)
Sep 22 15:57:53 localhost ntpd[8912]: proto: precision = 0.178 usec
Sep 22 15:57:53 localhost ntpd[8912]: 0.0.0.0 c01d 0d kern kernel time sync enabled
Sep 22 15:57:53 localhost ntpd[8912]: Listen and drop on 0 v4wildcard 0.0.0.0 UDP 123
Sep 22 15:57:53 localhost ntpd[8912]: Listen and drop on 1 v6wildcard :: UDP 123
Sep 22 15:57:53 localhost ntpd[8912]: Listen normally on 2 lo 127.0.0.1 UDP 123
Sep 22 15:57:53 localhost ntpd[8912]: Listen normally on 3 eth0 10.0.2.15 UDP 123
Sep 22 15:57:53 localhost ntpd[8912]: Listen normally on 4 eth1 192.168.10.130 UDP 123
Sep 22 15:57:53 localhost ntpd[8912]: Listen normally on 5 eth1 fe80::a00:27ff:fe32:747b UDP 123
Sep 22 15:57:53 localhost ntpd[8912]: Listen normally on 6 lo ::1 UDP 123
Sep 22 15:57:53 localhost ntpd[8912]: Listen normally on 7 eth0 fe80::5054:ff:fe1c:c046 UDP 123
Sep 22 15:57:53 localhost ntpd[8912]: Listening on routing socket on fd #24 for interface updates
Sep 22 15:57:53 localhost ntpd[8912]: 0.0.0.0 c016 06 restart
Sep 22 15:57:53 localhost ntpd[8912]: 0.0.0.0 c012 02 freq_set kernel 11.841 PPM
Sep 22 15:58:00 localhost ntpd[8912]: 0.0.0.0 c615 05 clock_sync

ntpd version:

$ ntpd --version
ntpd 4.2.6p5
ntpd 4.2.6p5@1.2349-o Mon Feb  6 07:22:46 UTC 2017 (1)

CentOS6 version:

$ cat /etc/centos-release 
CentOS release 6.9 (Final)

StackStorm version:

$ st2 --version
st2 2.4.1

In my setup I couldn't reproduce your issue and could start services on another node with remote_sudo and they didn't crash and were running. StackStorm here shouldn't be different than just running a command via ssh on a remote host. From another point, older systems like CentOS 6.7 might have their own bugs or you might have some non-standard OS/box configuration which could affect the remote execution in a strange way.

So if you could debug deeper and find something more interesting in logs or setup, - please share with us.

arm4b commented 6 years ago

Closing this.

Please re-open issue if you experience the same behavior in future and have more detailed info.