Open LindsayHill opened 7 years ago
@armab do you have any alternative ideas on how we can help users identify where the problems lie with their ST2 systems when RabbitMQ/PostgreSQL/MongoDB is broken, and st2ctl status
says "everything is fine, nothing to see here?"
Just a small bit of (hopefully) relevant info:
On our st2 service we sometimes have to restart mongod service. So I add my +1 on this.
And I can add that a (sligthly) more thorough check (not only PID / not running status) would be good to have, like doing a test connection to each service endpoint.
@vincent-legoll yes, I would like to have a /health
API endpoint, or something like that.
Do you know why mongod needs restarting? We used to see some issues in the Mongo 2.x timeframe, but haven't heard many reports like that since 3.2/3.4
@LindsayHill a health API would be great.
And no I still haven't investigated much the mongodb issue, as I wasn't sure only restarting this could be sufficient (I previously fully restarted st2 prior to mongod)
db version v3.2.17 git version: 186656d79574f7dfe0831a7e7821292ab380f667 OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013 allocator: tcmalloc modules: none build environment: distmod: rhel70 distarch: x86_64 target_arch: x86_64
I'll do more in-depth debugging next time
I'm 👍 to the feature st2ctl
showing that MongoDB
& RabbitMQ
are accessible & in a working state from the StackStorm core perspective.
Eg. there should be a Python check from st2 service point of view.
I'm yet 👎 for PostgreSQL
and 👎 nginx
, since st2mistral
could be not installed at all and nginx is not a strict requirement as well and may replaced by any other proxy server or 9101 ports may be used directly.
We could go first with a Mongo + RabbitMQ checks as a core-required services and solve the most common issues from the user's perspective, when these 2 backend services are not avail for some reason.
st2ctl status
reports on the status of st2 processes. It does not report anything about nginx, MongoDB, PostgreSQL or RabbitMQ.This can be confusing for users who are experiencing issues with those services. They run
st2ctl status
, and it appears to report that everything is working, when in reality RabbitMQ is broken because they ran out of disk space. This wastes everyone's time.The challenge here is that those dependencies may be running on a separate system.
Perhaps we should add a check that makes a test connection to those services (based on their definition in
/etc/st2/st2.conf
, and reports the results?