Closed arschles closed 8 years ago
Is this still WIP? The WIP in the title and the "awaiting review" tag are giving me mixed signals.
@krancour sorry about that. Yes, still in progress. I changed the label to indicate so.
Tested myself. To test yourself, spin up a deis-dev cluster and kd delete rc deis-builder && kd create -f manifests/deis-builder-rc.yaml && kd get pod -w
. Once you see the pod come up, wait for it to go into Running
and 1/1
containers up.
@krancour this is now ready for review
why not modify fetcher to health serv. Do we need another http server to handle health serv. Already fetcher has a health check handler
Anyways we are thinking of moving away from fetcher . So please Ignore my previous comment
If k8s API is not accessible how come it's builder fault . Also it not making sense to me to check if API is accessible or not as builder wont start in the first place. I agree checking S3 storage is accessible or not This is overall amazing code. I feel its more complex design If we are coming up a different health server why do we need an entire class to atomically set a value. We can just write one more handler in the health server that sets the circuit state value and ssh.server.go calls that handler before this https://github.com/deis/builder/blob/master/pkg/sshd/server.go#L73 and health check handle polls change of circuit state till the timeout is over.
Also liveness probes shouldn't be checking S3 API accessibility
Already server handles ping request, but ping method wasn't implemented https://github.com/deis/builder/blob/master/pkg/sshd/server.go#L202. I think liveness probe should call ping
If k8s API is not accessible how come it's builder fault
It's not the builder's fault, but the builder shouldn't be able to accept any request to do work when object storage, a dependency without which it can't do its job, is unavailable to it.
If we are coming up a different health server why do we need an entire class to atomically set a value
We don't necessarily need this Circuit
type, but it presents the simplest abstraction between the SSH server and the health check server I could find. The basic requirement is that the SSH server can communicate its status to the health server in a concurrency safe manner.
I tried a few other implementations, including a sync.WaitGroup
implementation and a channel based implementation, but both ended up being more code with more complexity.
We can just write one more handler in the health server that sets the circuit state value and ssh.server.go calls that handler before this https://github.com/deis/builder/blob/master/pkg/sshd/server.go#L73
I'm not 100% sure I understand what you mean here, but the closure of the circuit happens right above that line here, and that is the logic that effectively communicates to the health check server that the ssh server is ready
IMO It doesn't make as much sense to call an HTTP endpoint than to make a function call from within the same process.
health check handle polls change of circuit state till the timeout is over.
The health check server essentially does poll the circuit state because Kubernetes polls the health check endpoint. I opted not to implement any polling logic in the health check endpoint in favor of just reusing the Kubernetes logic.
Already server handles ping request, but ping method wasn't implemented https://github.com/deis/builder/blob/master/pkg/sshd/server.go#L202. I think liveness probe should call ping
I think implementing the health check logic in the SSH ping handler is suboptimal. Here's my thinking:
git push
es. By putting health check logic as-is into the ping (and I believe the health check logic we have now is 100% necessary), we allow any end-user of the builder to (possibly unwillingly or unknowingly) exercise our object store and the Kubernetes API arbitrarily.@bacongobbler are you referring to the AbsPath("/healthz")
bug?
Sorry, that comment was in regards to refactoring out pkg/log to use the standard log package as in this commit. Github doesn't really show well that I was making a comment on that commit.
Ah, got it. I see the tiny text where it says you commented on the commit now.
Filed https://github.com/deis/pkg/issues/19 to track
Code LGTM
Gonna hold off on merging this until the discussion at https://github.com/deis/builder/pull/149#issuecomment-183126063 is complete
FWIW, here is my view on health checking logic:
Neither probes are about establishing "fault", the purpose of the probes is to indicate that the system believes itself to be in a position to fulfill its service contract. That's it. If it can't reach a database, cache, API endpoint, etc., it should fail its check.
That said, I am a little concerned with probes being dependent on the state of Kubernetes itself. So the k8s API service availability is sort of a gray area. Originally, I was opposed to failing healthz based on k8s API server problems, but I've changed my mind. Here's the use case that caused me to change my mind:
In this case, builder SHOULD fail because it can no longer fulfill its service contract. If a policy was changed, k8s itself will restart the pod and it will come up successful. Otherwise, it will go into crash backoff, which is exactly what ought to happen.
A liveness probe should not fail if k8s is down or s3 is reachable and here is why:
A liveness probe will cause a pod to restart when those external components fail, which in turn means the user can not even do git push
but instead gets a unreachable hostname if there are no builders to service the request.
As a user (developer) I want to do a git push
which fails gracefully if dependant services are down with a good error message, along the lines of "unable to process the git push right now, X and Y are down".
Builder is alive and can give good information back to the user, making the liveness probe depend on 3rd party things is a bad UX since the whole builder becomes unavailable at that point. I'm proposing we do the deeper health check during git push
and readiness
probe but not liveness
probe. Different purposes and different external informational needs by devs / operators.
I'd like to hear the opposite side on how we'd handle the scenario of builder being down and telling the developers (and operators) what is going on and how they would effectively debug things when the builder is effectively left in a "running but not ready" state until s3 / k8s / etc comes up?
To chime in my thoughts as well:
In this case, builder SHOULD fail because it can no longer fulfill its service contract. If a policy was changed, k8s itself will restart the pod and it will come up successful. Otherwise, it will go into crash backoff, which is exactly what ought to happen.
This is great for an administrator, but not for a user. In the current implementation the user has no view into why the builder is down. If we were to go down this route, a status page (similar to status pages like http://status.docker.com/) would be invaluable to a user to see what systems are down, bridging the communication gap between users and administrators.
I'd like to hear the opposite side on how we'd handle the scenario of builder being down and telling the developers (and operators) what is going on and how they would effectively debug things when the builder is effectively left in a "running but not ready" state until s3 / k8s / etc comes up?
^^
Another proposed solution offline was to gracefully fail a git push
, but internally continuously perform health checks on upstream services like k8s/minio, which would then be logged if they're reporting as down. That way we can pick that up in https://github.com/deis/monitor with alerting (not implemented in telegraf AFAIK but not something that we couldn't do ourselves)
Ok, after extensive internal discussion, consensus is as follows:
v2-beta
release, the logic in this PR will suffice as-isgit push
logic respond with an informative error message during downstream dependency failures (#183)
Fixes #92 Fixes #89 Ref deis/deis#4809 When this is done, merge deis/charts#93
PR Overview
/healthz
), intended to be used by thelivenessProbe
andreadinessProbe
.interface
s - for S3 interaction and Kubernetes API interaction - so that those two components can be mocked and the server can be more easily unit tested/healthz
endpoint does all three checks concurrently, and only fails if one or more of them don't complete before a hard-coded timeout, or any of them fail before the timeout(github.com/deis/builder/pkg/sshd).Circuit
type, which acts as a simple, concurrency-safe on/off switch for communication across concurrency barriers. ACircuit
is used to allow the SSH server to tell the health check server when it's ready to serveboot.go
to serve the SSH server and the health check server concurrently, but to kill the process with an error code if either failsdeis-builder-rc.yaml
manifestMakefile
to only look for non-vendored packages whenmake test
is run, which makes all runs fasterStill TODO
healthZHandler
TestOpenCloseConcurrent
make test
output:pkg/sshd/server_test.go:106: not enough arguments in call to Serve