Open arm4b opened 4 years ago
What's the point of periodic monitoring?
Periodic monitoring accomplishes three objectives:
- Update each server's type, tags, and round trip time. Read preferences and the mongos selection algorithm require this information remains up to date.
- Discover new secondaries so that secondary reads are evenly spread.
- Detect incremental changes to the replica set configuration, so that the client remains connected to the set even while it is migrated to a completely new set of hosts. If the application uses some servers very infrequently, monitoring can also proactively detect state changes (primary stepdown, server becoming unavailable) that would otherwise cause future errors.
Why close connections when a node is shutting down?
When a server shuts down, it will return one of the "node is shutting down" errors for each attempted operation and eventually will close all connections. Keeping a connection to a server which is shutting down open would only produce errors on this connection - such a connection will never be usable for any operations.
Heartbeat checking is critical to stable work with MongoDB cluster, especially if it's HA replicaset configuration.
From https://api.mongodb.com/python/3.9.0/examples/high_availability.html#health-monitoring
Health Monitoring When MongoClient is initialized it launches background threads to monitor the replica set for changes in:
- Health: detect when a member goes down or comes up, or if a different member becomes primary
- Configuration: detect when members are added or removed, and detect changes in members’ tags
- Latency: track a moving average of each member’s ping time
Looks like it should be essential part of pymongo
implementation already.
We'll just need to benefit from it in ST2.
https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst is just a gold piece of doc describing desired technical implementation from the client side for best MongoDB cluster experience.
We've also observed that the workflow engine is very sensitive to mongo going down. In a replica set, if the primary goes down and a secondary is elected to replace it, that brief period of "downtime" can cause workflows to fail with mongo connectivity errors.
Some resiliency built into those processes would be great, especially considering we are trying to build a HA deployment of st2.
Problem
When StackStorm services are already running and MongoDB backend suddenly goes down, no services usually detect connection error, report that in logs nor try to re-connect in a pro-active way. They keep running and "alive" as nothing happened.
Reproducing
1) Start StackStorm, follow the logs 2) Stop MongoDB 3) Notice that StackStorm services DGAF about any MongoDB connection issues
Bonus points go to
st2api
which even normally responds with empty results on HTTP requests.It turns out that services start to report connection errors only when they're processing something and expecting a response from DB request. This can be tens of minutes, depending on st2 cluster workload.
This lazy behavior leads to a situation when we think service is working OK, while in fact it's just pretending and loosing incoming requests with no DB connection. https://github.com/StackStorm/st2/issues/4777 and #4020 is somewhat related.
Expected behavior
Good behavior if mongo client would verify connection in background loop and report back ASAP if there is an error in logs.
Ideally if it can support
heartbeat
setting inst2.conf
and check pro-actively for DB connection aliveness:Note also https://api.mongodb.com/python/current/api/pymongo/monitoring.html
HeartbeatLogger
andConnectionPoolLogger