Open elee1766 opened 3 months ago
(As clarified in Slack, this is about active Health checks.)
My vote is currently for no. 1.
2 - Is a NO from me because blocking provisioning can make config reloads slow, and we strive to keep them fast and lightweight.
3 - is a NO from me because if the proxy is started before the backends, we can't assume that backends are healthy right away. IMO, active health checks should assume unhealthy unless proven otherwise by a passing health check (compared to passive health checks, which assume healthy until proved otherwise).
Number 1 is nice because it allows the server/config to start quickly, and the requests don't have to fail (even if they are delayed briefly). We also don't have bad status information. I imagine health checks -- especially passing ones -- happen very quickly, so the blocking will be instantaneous, less than 1/4 second probably.
Note that health_passes 3
means that after failing, an upstream node needs to pass three successive health checks to become healthy again.
I'm okay with 1 as long as the current behavior remains where a health check is immediately fired and the block is near instantaneous.
I believe other loadbalancers like Nginx (paid) assumes that all listed upstreams are healthy after a reload/restart and doesn't take them out of the mix until the health checks fail.
I believe other loadbalancers like Nginx (paid) assumes that all listed upstreams are healthy after a reload/restart and doesn't take them out of the mix until the health checks fail.
basically correct. during investigation i found that nginx plus and traefik set the initial state of the backend when no health checks have been made to them as healthy. However, they do preserve history across restarts to the same hosts (as does caddy, i believe)
basically correct. during investigation i found that nginx plus and traefik set the initial state of the backend when no health checks have been made to them as healthy.
I didn't think about what other servers do when we implemented health checks, but this is surprising to me... it feels wrong for active health checks to assume a healthy backend without checking first. Marking them as healthy when you don't actually know seems... misleading?
However, they do preserve history across restarts to the same hosts (as does caddy, i believe)
Caddy preserves the health status across reloads but if the process quits then the memory is cleared. We don't persist it to storage as of yet.
Marking them as healthy when you don't actually know seems... misleading?
i think the argument can be made that marking them unhealthy is equally misleading. the remote is in superposition, since it has not been obvserved, it's a third distinct state that is currently handled as the unhealthy case. it seems existing implementations tip the scale slightly in favor the healthy superposition, my guess is it is order to have a faster time to first response.
Currently, a remote is marked unhealthy if no active health checks to the remote have been done.
this causes the reverse proxy to return 503 before a health check is completed, even if the remote is truly healthy, in the time between load competing and the first health check.
there are a few solutions to the problem, but we have not decided which is correct.