caddyserver / caddy

Fast and extensible multi-platform HTTP/1-2-3 web server with automatic HTTPS
https://caddyserver.com
Apache License 2.0
57.34k stars 4k forks source link

Reverse proxy startup health check behavior results in 503 errors #6410

Open elee1766 opened 3 months ago

elee1766 commented 3 months ago

Currently, a remote is marked unhealthy if no active health checks to the remote have been done.

this causes the reverse proxy to return 503 before a health check is completed, even if the remote is truly healthy, in the time between load competing and the first health check.

there are a few solutions to the problem, but we have not decided which is correct.

  1. block/hold requests if all remotes have no health history
  2. block the provisioning of caddy until one health check round has completed (#6407)
  3. set the default active health state of an uninspected remote to healthy.
  4. allow the configuration of 3, with a sane default
mholt commented 3 months ago

(As clarified in Slack, this is about active Health checks.)

My vote is currently for no. 1.

2 - Is a NO from me because blocking provisioning can make config reloads slow, and we strive to keep them fast and lightweight.

3 - is a NO from me because if the proxy is started before the backends, we can't assume that backends are healthy right away. IMO, active health checks should assume unhealthy unless proven otherwise by a passing health check (compared to passive health checks, which assume healthy until proved otherwise).

Number 1 is nice because it allows the server/config to start quickly, and the requests don't have to fail (even if they are delayed briefly). We also don't have bad status information. I imagine health checks -- especially passing ones -- happen very quickly, so the blocking will be instantaneous, less than 1/4 second probably.

ottenhoff commented 3 months ago

Note that health_passes 3 means that after failing, an upstream node needs to pass three successive health checks to become healthy again.

I'm okay with 1 as long as the current behavior remains where a health check is immediately fired and the block is near instantaneous.

I believe other loadbalancers like Nginx (paid) assumes that all listed upstreams are healthy after a reload/restart and doesn't take them out of the mix until the health checks fail.

elee1766 commented 3 months ago

I believe other loadbalancers like Nginx (paid) assumes that all listed upstreams are healthy after a reload/restart and doesn't take them out of the mix until the health checks fail.

basically correct. during investigation i found that nginx plus and traefik set the initial state of the backend when no health checks have been made to them as healthy. However, they do preserve history across restarts to the same hosts (as does caddy, i believe)

mholt commented 3 months ago

basically correct. during investigation i found that nginx plus and traefik set the initial state of the backend when no health checks have been made to them as healthy.

I didn't think about what other servers do when we implemented health checks, but this is surprising to me... it feels wrong for active health checks to assume a healthy backend without checking first. Marking them as healthy when you don't actually know seems... misleading?

However, they do preserve history across restarts to the same hosts (as does caddy, i believe)

Caddy preserves the health status across reloads but if the process quits then the memory is cleared. We don't persist it to storage as of yet.

elee1766 commented 3 months ago

Marking them as healthy when you don't actually know seems... misleading?

i think the argument can be made that marking them unhealthy is equally misleading. the remote is in superposition, since it has not been obvserved, it's a third distinct state that is currently handled as the unhealthy case. it seems existing implementations tip the scale slightly in favor the healthy superposition, my guess is it is order to have a faster time to first response.