Open HariSekhon opened 5 years ago
Any ideas for what sort of healthchecks would be useful? As in, what would you consider a failure for which the healthcheck should return a not-OK status?
It should check at the minimum that the TSD is properly initialized and ready to accept queries, and that the data backend is available (but not by querying any particular metric as that region could be temporarily in transition in the case of HBase but not affecting most other metrics and could cause a needless outage of the entire farm).
It should also return non-OK on a controlled shutdown, with a delay of 10 + seconds (preferably configurable) between initiating setting non-OK and stopping accepting queries for a load balancer to detect and drop it out of the pool gracefully and stop sending it queries so no queries are lost.
Any other diagnostics doing the normal running of the process could also affect the status, but that would be best answered by the core developers.
So looking through the code, TSDB.ensureNecessaryTablesExist()
seems like a reasonable first pass candidate for OKness:
Verifies that the data and UID tables exist in HBase and optionally the tree and meta data tables if the user has enabled meta tracking or tree building.
And RpcManager.isInitialized()
to check if startup is ready, and RPCManager.shutdown()
can set a flag for shutdown. In current code the latter happens as the first stage in shutdown, so either flag is set at beginning of RPC shutdown, or RPC shutdown is finished, in which case /api/status
will fail and that's fine.
Possibly HTTP queries won't even work if RpcManager.isInitialized()
is false, but would have to dig deeper to figure that out. But perhaps it should just be "start in startup
status, and only transition to ok
first time ensureNecessaryTablesExist()
is true, and after that can only transition to shutting-down
."
Digging further, checkNecessaryTablesExist()
isn't sufficient, seems like it'd have same issue @HariSekhon is trying to avoid. It uses HBaseClient.ensureTableExists
, which says:
// Just "fault in" the first region of the table. Not the most optimal or
// useful thing to do but gets the job done for now. TODO(tsuna): Improve.
So plausibly I can:
HBaseClient.locateRegions()
.probeKey
in HBaseClient
) for each region.I have a first pass implementation here: https://github.com/OpenTSDB/opentsdb/compare/next...itamarst:1584-status-api?expand=1
Testing it for the error mode is a little tricky without a full-fledged HBase cluster, though. If my HBase-in-Docker is completely down then the status API query times out due to OpenTSDB continuously trying to reconnect. Plausibly this is OK since a load balancer would interpret that as "being down".
But partial shutdown, or root HBase up but region servers down is harder for me to test.
@HariSekhon do you think you could this branch a try one of the clusters you have access to?
@itamarst I don't work for that company any more and no longer do OpenTSDB, I'm off working on other things now, sorry!
Feature Request to add a diagnostic
/api/status
API endpoint.This should do self diagnostic checks and return an overall
{"status": "ok"}
or something similar (there can be other more detailed fields too).This is useful for load balanced OpenTSDB farms like I run at scale or OpenTSDB on Kubernetes health checks (a setup I've done recently) as a way of detecting which TSDs are completely healthy and should be in the service endpoint pool.
Currently I use
/api/version
to check the TSDs are up but this is too rudimentary in my opinion as it isn't really tied to any self diagnostics. Other technologies I've worked with often provide/api/status
or similar REST API endpoint for health check purposes.The environment I'm working in currently historically did a metric query check against a specific metric as their health check but I advised against this as HBase can have partial data outages upon any RegionServer issue or even RegionServer rolling restarts. We later ended up with a complete unnecessary load balancer 503 outage because one region was temporarily unavailable / in-transition which caused the health checks to fail for all TSDs on the load balancer because they all hit that Region down/in-transition, which is why I promptly switched the health checks to
/api/version
to test the TSD instances instead of the any particular region on the backend data store.