arschles commented 8 years ago

Fixes #92 Fixes #89 Ref deis/deis#4809 When this is done, merge deis/charts#93

PR Overview

Added a health check server with a single endpoint (/healthz), intended to be used by the livenessProbe and readinessProbe.
- This endpoint checks the accessibility of the Kubernetes API & the object storage system, and checks that the SSH server is ready to serve
- The health check server utilizes two interfaces - for S3 interaction and Kubernetes API interaction - so that those two components can be mocked and the server can be more easily unit tested
- The handler for the /healthz endpoint does all three checks concurrently, and only fails if one or more of them don't complete before a hard-coded timeout, or any of them fail before the timeout
Added a (github.com/deis/builder/pkg/sshd).Circuit type, which acts as a simple, concurrency-safe on/off switch for communication across concurrency barriers. A Circuit is used to allow the SSH server to tell the health check server when it's ready to serve
Modified boot.go to serve the SSH server and the health check server concurrently, but to kill the process with an error code if either fails
Modified the manifests to closely resemble those in github.com/deis/charts (#89)
Added liveness and readiness checks to the deis-builder-rc.yaml manifest
Modified the Makefile to only look for non-vendored packages when make test is run, which makes all runs faster
Still TODO
[x] Check k8s server on probe request
[x] Check that the server is actually running on probe request
[x] Tests for healthZHandler
[x] TestOpenCloseConcurrent
[x] make test output: pkg/sshd/server_test.go:106: not enough arguments in call to Serve

krancour commented 8 years ago

Is this still WIP? The WIP in the title and the "awaiting review" tag are giving me mixed signals.

arschles commented 8 years ago

@krancour sorry about that. Yes, still in progress. I changed the label to indicate so.

arschles commented 8 years ago

Tested myself. To test yourself, spin up a deis-dev cluster and kd delete rc deis-builder && kd create -f manifests/deis-builder-rc.yaml && kd get pod -w. Once you see the pod come up, wait for it to go into Running and 1/1 containers up.

arschles commented 8 years ago

@krancour this is now ready for review

smothiki commented 8 years ago

why not modify fetcher to health serv. Do we need another http server to handle health serv. Already fetcher has a health check handler

smothiki commented 8 years ago

Anyways we are thinking of moving away from fetcher . So please Ignore my previous comment

smothiki commented 8 years ago

If k8s API is not accessible how come it's builder fault . Also it not making sense to me to check if API is accessible or not as builder wont start in the first place. I agree checking S3 storage is accessible or not This is overall amazing code. I feel its more complex design If we are coming up a different health server why do we need an entire class to atomically set a value. We can just write one more handler in the health server that sets the circuit state value and ssh.server.go calls that handler before this https://github.com/deis/builder/blob/master/pkg/sshd/server.go#L73 and health check handle polls change of circuit state till the timeout is over.

smothiki commented 8 years ago

Also liveness probes shouldn't be checking S3 API accessibility

smothiki commented 8 years ago

Already server handles ping request, but ping method wasn't implemented https://github.com/deis/builder/blob/master/pkg/sshd/server.go#L202. I think liveness probe should call ping

arschles commented 8 years ago

If k8s API is not accessible how come it's builder fault

It's not the builder's fault, but the builder shouldn't be able to accept any request to do work when object storage, a dependency without which it can't do its job, is unavailable to it.

If we are coming up a different health server why do we need an entire class to atomically set a value

We don't necessarily need this Circuit type, but it presents the simplest abstraction between the SSH server and the health check server I could find. The basic requirement is that the SSH server can communicate its status to the health server in a concurrency safe manner.

I tried a few other implementations, including a sync.WaitGroup implementation and a channel based implementation, but both ended up being more code with more complexity.

We can just write one more handler in the health server that sets the circuit state value and ssh.server.go calls that handler before this https://github.com/deis/builder/blob/master/pkg/sshd/server.go#L73

I'm not 100% sure I understand what you mean here, but the closure of the circuit happens right above that line here, and that is the logic that effectively communicates to the health check server that the ssh server is ready

IMO It doesn't make as much sense to call an HTTP endpoint than to make a function call from within the same process.

health check handle polls change of circuit state till the timeout is over.

The health check server essentially does poll the circuit state because Kubernetes polls the health check endpoint. I opted not to implement any polling logic in the health check endpoint in favor of just reusing the Kubernetes logic.

Already server handles ping request, but ping method wasn't implemented https://github.com/deis/builder/blob/master/pkg/sshd/server.go#L202. I think liveness probe should call ping

I think implementing the health check logic in the SSH ping handler is suboptimal. Here's my thinking:

The readiness and liveness probes could become expensive in the future as the SSH server grows in complexity. In fact, they may add more CPU load onto the existing server (I have no data on this, purely a guess)
The health check logic would be unnecessarily mixed with the SSH logic, violating separation of concerns. This point also relates to the previous
The end-user expects to do an ssh ping operation to determine whether the server up and ready for their git pushes. By putting health check logic as-is into the ping (and I believe the health check logic we have now is 100% necessary), we allow any end-user of the builder to (possibly unwillingly or unknowingly) exercise our object store and the Kubernetes API arbitrarily.

arschles commented 8 years ago

@bacongobbler are you referring to the AbsPath("/healthz") bug?

bacongobbler commented 8 years ago

Sorry, that comment was in regards to refactoring out pkg/log to use the standard log package as in this commit. Github doesn't really show well that I was making a comment on that commit.

arschles commented 8 years ago

Ah, got it. I see the tiny text where it says you commented on the commit now.

Filed https://github.com/deis/pkg/issues/19 to track

aledbf commented 8 years ago

Code LGTM

arschles commented 8 years ago

Gonna hold off on merging this until the discussion at https://github.com/deis/builder/pull/149#issuecomment-183126063 is complete

technosophos commented 8 years ago

FWIW, here is my view on health checking logic:

The readiness check should ONLY pass if the underlying system believes itself to be ready to successfully handle a request.
A liveness probe should ONLY pass if the system is in a condition where it could successfully handle a request.

Neither probes are about establishing "fault", the purpose of the probes is to indicate that the system believes itself to be in a position to fulfill its service contract. That's it. If it can't reach a database, cache, API endpoint, etc., it should fail its check.

That said, I am a little concerned with probes being dependent on the state of Kubernetes itself. So the k8s API service availability is sort of a gray area. Originally, I was opposed to failing healthz based on k8s API server problems, but I've changed my mind. Here's the use case that caused me to change my mind:

Builder is working just fine.
A well-meaning admin tries to lock down the API server by enforcing stricter security policies
Builder can no longer hit the API server.

In this case, builder SHOULD fail because it can no longer fulfill its service contract. If a policy was changed, k8s itself will restart the pod and it will come up successful. Otherwise, it will go into crash backoff, which is exactly what ought to happen.

helgi commented 8 years ago

A liveness probe should not fail if k8s is down or s3 is reachable and here is why:

A liveness probe will cause a pod to restart when those external components fail, which in turn means the user can not even do git push but instead gets a unreachable hostname if there are no builders to service the request.

As a user (developer) I want to do a git push which fails gracefully if dependant services are down with a good error message, along the lines of "unable to process the git push right now, X and Y are down".

Builder is alive and can give good information back to the user, making the liveness probe depend on 3rd party things is a bad UX since the whole builder becomes unavailable at that point. I'm proposing we do the deeper health check during git push and readiness probe but not liveness probe. Different purposes and different external informational needs by devs / operators.

I'd like to hear the opposite side on how we'd handle the scenario of builder being down and telling the developers (and operators) what is going on and how they would effectively debug things when the builder is effectively left in a "running but not ready" state until s3 / k8s / etc comes up?

bacongobbler commented 8 years ago

To chime in my thoughts as well:

In this case, builder SHOULD fail because it can no longer fulfill its service contract. If a policy was changed, k8s itself will restart the pod and it will come up successful. Otherwise, it will go into crash backoff, which is exactly what ought to happen.

This is great for an administrator, but not for a user. In the current implementation the user has no view into why the builder is down. If we were to go down this route, a status page (similar to status pages like http://status.docker.com/) would be invaluable to a user to see what systems are down, bridging the communication gap between users and administrators.

I'd like to hear the opposite side on how we'd handle the scenario of builder being down and telling the developers (and operators) what is going on and how they would effectively debug things when the builder is effectively left in a "running but not ready" state until s3 / k8s / etc comes up?

^^

Another proposed solution offline was to gracefully fail a git push, but internally continuously perform health checks on upstream services like k8s/minio, which would then be logged if they're reporting as down. That way we can pick that up in https://github.com/deis/monitor with alerting (not implemented in telegraf AFAIK but not something that we couldn't do ourselves)

arschles commented 8 years ago

Ok, after extensive internal discussion, consensus is as follows:

For now, and probably through the v2-beta release, the logic in this PR will suffice as-is
I will be creating issues for the following enhancements, to be done at various points in the release cycle
- Have health probes check the status of the controller (#181)
- Have health probes ping the local SSH server (instead of what they do now, which is only verify that it has started properly) (#182)
- Have the git push logic respond with an informative error message during downstream dependency failures (#183)

deis / builder

fix(*): add readiness & liveness probes #149

PR Overview

Still TODO