and the health checker start includes the initial request.
the issues are twofold.
the goroutine returns before the first round of h.doActiveHealthCheckForAllHosts
doActiveHealthCheckForAllHosts will return before any health checkers finish.
so the fixes are to
make sure provision doesn't return until activeHealthChecker has started, and make the goroutine start within activeHealthChecker
make sure that doActiveHealthCheckForAllHosts doesn't return until all checkers have finished.
of course, this "fix" also makes it so that if a health check hangs - caddy will stay in provisioning state until it times out. i think this is the correct behavior (caddy should fail health check until it has health checked all reverse proxies at least once), but maybe others disagree?
this change has the additional advantage of making it more difficult for configuration errors to stampede a remote
maybe there though needs to be better timeout guarantees on the underlying healthchecks though before this can be pushed? perhaps some people are relying on this weird behavior as well, i am not sure.
https://github.com/caddyserver/caddy/blob/master/caddytest/integration/reverseproxy_test.go#L359 there is a todo here:
TODO: for some reason this test seems particularly flaky, getting 503 when it should be 200, unless we wait
the reason is because health check hasn't succeeded, so there needs to be a wait to wait until the test starts.
the health checkers are started in goroutine, https://github.com/caddyserver/caddy/blob/master/modules/caddyhttp/reverseproxy/healthchecks.go#L254-L264
and the health checker start includes the initial request. the issues are twofold.
so the fixes are to
of course, this "fix" also makes it so that if a health check hangs - caddy will stay in provisioning state until it times out. i think this is the correct behavior (caddy should fail health check until it has health checked all reverse proxies at least once), but maybe others disagree?
this change has the additional advantage of making it more difficult for configuration errors to stampede a remote
maybe there though needs to be better timeout guarantees on the underlying healthchecks though before this can be pushed? perhaps some people are relying on this weird behavior as well, i am not sure.