OpenTSDB / opentsdb

A scalable, distributed Time Series Database.
http://opentsdb.net
GNU Lesser General Public License v2.1
4.99k stars 1.25k forks source link

Feature Request: HTTP /api/status Rest API #1584

Open HariSekhon opened 5 years ago

HariSekhon commented 5 years ago

Feature Request to add a diagnostic /api/status API endpoint.

This should do self diagnostic checks and return an overall {"status": "ok"} or something similar (there can be other more detailed fields too).

This is useful for load balanced OpenTSDB farms like I run at scale or OpenTSDB on Kubernetes health checks (a setup I've done recently) as a way of detecting which TSDs are completely healthy and should be in the service endpoint pool.

Currently I use /api/version to check the TSDs are up but this is too rudimentary in my opinion as it isn't really tied to any self diagnostics. Other technologies I've worked with often provide /api/status or similar REST API endpoint for health check purposes.

The environment I'm working in currently historically did a metric query check against a specific metric as their health check but I advised against this as HBase can have partial data outages upon any RegionServer issue or even RegionServer rolling restarts. We later ended up with a complete unnecessary load balancer 503 outage because one region was temporarily unavailable / in-transition which caused the health checks to fail for all TSDs on the load balancer because they all hit that Region down/in-transition, which is why I promptly switched the health checks to /api/version to test the TSD instances instead of the any particular region on the backend data store.

itamarst commented 5 years ago

Any ideas for what sort of healthchecks would be useful? As in, what would you consider a failure for which the healthcheck should return a not-OK status?

HariSekhon commented 5 years ago

It should check at the minimum that the TSD is properly initialized and ready to accept queries, and that the data backend is available (but not by querying any particular metric as that region could be temporarily in transition in the case of HBase but not affecting most other metrics and could cause a needless outage of the entire farm).

It should also return non-OK on a controlled shutdown, with a delay of 10 + seconds (preferably configurable) between initiating setting non-OK and stopping accepting queries for a load balancer to detect and drop it out of the pool gracefully and stop sending it queries so no queries are lost.

Any other diagnostics doing the normal running of the process could also affect the status, but that would be best answered by the core developers.

itamarst commented 5 years ago

So looking through the code, TSDB.ensureNecessaryTablesExist() seems like a reasonable first pass candidate for OKness:

Verifies that the data and UID tables exist in HBase and optionally the tree and meta data tables if the user has enabled meta tracking or tree building.

itamarst commented 5 years ago

And RpcManager.isInitialized() to check if startup is ready, and RPCManager.shutdown() can set a flag for shutdown. In current code the latter happens as the first stage in shutdown, so either flag is set at beginning of RPC shutdown, or RPC shutdown is finished, in which case /api/status will fail and that's fine.

Possibly HTTP queries won't even work if RpcManager.isInitialized() is false, but would have to dig deeper to figure that out. But perhaps it should just be "start in startup status, and only transition to ok first time ensureNecessaryTablesExist() is true, and after that can only transition to shutting-down."

itamarst commented 5 years ago

Digging further, checkNecessaryTablesExist() isn't sufficient, seems like it'd have same issue @HariSekhon is trying to avoid. It uses HBaseClient.ensureTableExists, which says:

// Just "fault in" the first region of the table.  Not the most optimal or
// useful thing to do but gets the job done for now.  TODO(tsuna): Improve.
itamarst commented 5 years ago

So plausibly I can:

  1. Get all regions and their starting keys from HBaseClient.locateRegions().
  2. Do a key lookup (ala the private probeKey in HBaseClient) for each region.
  3. If all regions are up, that's status "ok", otherwise it's status "partial".
itamarst commented 5 years ago

I have a first pass implementation here: https://github.com/OpenTSDB/opentsdb/compare/next...itamarst:1584-status-api?expand=1

Testing it for the error mode is a little tricky without a full-fledged HBase cluster, though. If my HBase-in-Docker is completely down then the status API query times out due to OpenTSDB continuously trying to reconnect. Plausibly this is OK since a load balancer would interpret that as "being down".

But partial shutdown, or root HBase up but region servers down is harder for me to test.

@HariSekhon do you think you could this branch a try one of the clusters you have access to?

HariSekhon commented 5 years ago

@itamarst I don't work for that company any more and no longer do OpenTSDB, I'm off working on other things now, sorry!