caddyserver / caddy

Fast and extensible multi-platform HTTP/1-2-3 web server with automatic HTTPS
https://caddyserver.com
Apache License 2.0
57.72k stars 4.01k forks source link

admin: /reverse_proxy/upstreams endpoint should provide info on active health checks #6135

Open ottenhoff opened 7 months ago

ottenhoff commented 7 months ago

Use case: I know how to health-check my upstreams separately, but I want to quickly understand which upstreams Caddy believes are healthy or unhealthy.

I am using active health checks

$ curl http://localhost:2019/reverse_proxy/upstreams | jq
[
  {
    "address": "10.1.67.25:8156",
    "num_requests": 0,
    "fails": 0
  },
  {
    "address": "10.2.67.25:9286",
    "num_requests": 0,
    "fails": 0
  }
]

Other load balancers provide info on each upstream:

  1. number of active checks run since last restart
  2. number of active checks that failed
  3. Healthy/unhealthy flag
  4. "Last" check status (e.g., if you require 3 successive checks to become healthy then the last check would be an early indicator that your upstream is on its way to healthy)
francislavoie commented 7 months ago

The problem is, health status is not actually in the upstream storage. Health is a factor of the proxy config compared to the storage. See https://caddyserver.com/docs/api#get-reverse-proxyupstreams which explains.

Basically, you could have more than one proxy configured, each with different settings for max_fails so it might healthy for one proxy but not the other.

The admin API can only show what's actually in storage (one global storage pool server-wide), it doesn't know which proxy you're asking about.

ottenhoff commented 7 months ago

Right, for a passive check, I understand that providing the health of an upstream depends on the proxy config. What about an active check? I'd like to know how many active health checks are passing/failing

francislavoie commented 7 months ago

The active health checks are stored on the individual proxy instance's references to the upstream, so it's not in the global store. So we don't have any way to access it from the admin API at this time.

Basically, the current data structures make it impossible to track this kind of information from the admin API. We'll need to rework things at some point to make it possible.

What you can do though is hook into events which would let you do something in reaction to a change in health status of an upstream.

mholt commented 7 months ago

Yeah, passive "health status" (up/down) is a matter of interpreting the number of fails currently remembered. It's up to you to decide whether that constitutes down or not. You could simply use the same parameters as your config, at least for now.

I have some ideas on how to make this more usable with API endpoints but would probably need a sponsorship to fund its development at this point with a lot on my plate right now.