st2 services ignore MongoDB failures

arm4b commented 4 years ago

Problem

When StackStorm services are already running and MongoDB backend suddenly goes down, no services usually detect connection error, report that in logs nor try to re-connect in a pro-active way. They keep running and "alive" as nothing happened.

Reproducing

1) Start StackStorm, follow the logs 2) Stop MongoDB 3) Notice that StackStorm services DGAF about any MongoDB connection issues

Bonus points go to st2api which even normally responds with empty results on HTTP requests.

It turns out that services start to report connection errors only when they're processing something and expecting a response from DB request. This can be tens of minutes, depending on st2 cluster workload.

This lazy behavior leads to a situation when we think service is working OK, while in fact it's just pretending and loosing incoming requests with no DB connection. https://github.com/StackStorm/st2/issues/4777 and #4020 is somewhat related.

Expected behavior

Good behavior if mongo client would verify connection in background loop and report back ASAP if there is an error in logs.

Ideally if it can support heartbeat setting in st2.conf and check pro-actively for DB connection aliveness:

Note also https://api.mongodb.com/python/current/api/pymongo/monitoring.html HeartbeatLogger and ConnectionPoolLogger

arm4b commented 4 years ago

Based on https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#whats-the-point-of-periodic-monitoring

What's the point of periodic monitoring?

Periodic monitoring accomplishes three objectives:

Update each server's type, tags, and round trip time. Read preferences and the mongos selection algorithm require this information remains up to date.

Discover new secondaries so that secondary reads are evenly spread.

Detect incremental changes to the replica set configuration, so that the client remains connected to the set even while it is migrated to a completely new set of hosts. If the application uses some servers very infrequently, monitoring can also proactively detect state changes (primary stepdown, server becoming unavailable) that would otherwise cause future errors.

and https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#why-close-connections-when-a-node-is-shutting-down

Why close connections when a node is shutting down?

When a server shuts down, it will return one of the "node is shutting down" errors for each attempted operation and eventually will close all connections. Keeping a connection to a server which is shutting down open would only produce errors on this connection - such a connection will never be usable for any operations.

Heartbeat checking is critical to stable work with MongoDB cluster, especially if it's HA replicaset configuration.

From https://api.mongodb.com/python/3.9.0/examples/high_availability.html#health-monitoring

Health Monitoring When MongoClient is initialized it launches background threads to monitor the replica set for changes in:

Health: detect when a member goes down or comes up, or if a different member becomes primary

Configuration: detect when members are added or removed, and detect changes in members’ tags

Latency: track a moving average of each member’s ping time

Looks like it should be essential part of pymongo implementation already. We'll just need to benefit from it in ST2.

https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst is just a gold piece of doc describing desired technical implementation from the client side for best MongoDB cluster experience.

trstruth commented 4 years ago

We've also observed that the workflow engine is very sensitive to mongo going down. In a replica set, if the primary goes down and a secondary is elected to replace it, that brief period of "downtime" can cause workflows to fail with mongo connectivity errors.

Some resiliency built into those processes would be great, especially considering we are trying to build a HA deployment of st2.

StackStorm / st2