StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.07k stars 749 forks source link

`st2ctl status` should report nginx, MongoDB, PostgreSQL & RabbitMQ status #3779

Open LindsayHill opened 7 years ago

LindsayHill commented 7 years ago

st2ctl status reports on the status of st2 processes. It does not report anything about nginx, MongoDB, PostgreSQL or RabbitMQ.

This can be confusing for users who are experiencing issues with those services. They run st2ctl status, and it appears to report that everything is working, when in reality RabbitMQ is broken because they ran out of disk space. This wastes everyone's time.

The challenge here is that those dependencies may be running on a separate system.

Perhaps we should add a check that makes a test connection to those services (based on their definition in /etc/st2/st2.conf, and reports the results?

LindsayHill commented 7 years ago

@armab do you have any alternative ideas on how we can help users identify where the problems lie with their ST2 systems when RabbitMQ/PostgreSQL/MongoDB is broken, and st2ctl status says "everything is fine, nothing to see here?"

vincent-legoll commented 7 years ago

Just a small bit of (hopefully) relevant info:

On our st2 service we sometimes have to restart mongod service. So I add my +1 on this.

And I can add that a (sligthly) more thorough check (not only PID / not running status) would be good to have, like doing a test connection to each service endpoint.

LindsayHill commented 7 years ago

@vincent-legoll yes, I would like to have a /health API endpoint, or something like that.

Do you know why mongod needs restarting? We used to see some issues in the Mongo 2.x timeframe, but haven't heard many reports like that since 3.2/3.4

vincent-legoll commented 7 years ago

@LindsayHill a health API would be great.

And no I still haven't investigated much the mongodb issue, as I wasn't sure only restarting this could be sufficient (I previously fully restarted st2 prior to mongod)

db version v3.2.17 git version: 186656d79574f7dfe0831a7e7821292ab380f667 OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013 allocator: tcmalloc modules: none build environment: distmod: rhel70 distarch: x86_64 target_arch: x86_64

I'll do more in-depth debugging next time

arm4b commented 6 years ago

I'm 👍 to the feature st2ctl showing that MongoDB & RabbitMQ are accessible & in a working state from the StackStorm core perspective. Eg. there should be a Python check from st2 service point of view.

I'm yet 👎 for PostgreSQL and 👎 nginx, since st2mistral could be not installed at all and nginx is not a strict requirement as well and may replaced by any other proxy server or 9101 ports may be used directly.


We could go first with a Mongo + RabbitMQ checks as a core-required services and solve the most common issues from the user's perspective, when these 2 backend services are not avail for some reason.