deis / builder

Git server and application builder for Deis Workflow
https://deis.com
MIT License
40 stars 41 forks source link

fix(*): add readiness & liveness probes #149

Closed arschles closed 8 years ago

arschles commented 8 years ago

Fixes #92 Fixes #89 Ref deis/deis#4809 When this is done, merge deis/charts#93

PR Overview

krancour commented 8 years ago

Is this still WIP? The WIP in the title and the "awaiting review" tag are giving me mixed signals.

arschles commented 8 years ago

@krancour sorry about that. Yes, still in progress. I changed the label to indicate so.

arschles commented 8 years ago

Tested myself. To test yourself, spin up a deis-dev cluster and kd delete rc deis-builder && kd create -f manifests/deis-builder-rc.yaml && kd get pod -w. Once you see the pod come up, wait for it to go into Running and 1/1 containers up.

arschles commented 8 years ago

@krancour this is now ready for review

smothiki commented 8 years ago

why not modify fetcher to health serv. Do we need another http server to handle health serv. Already fetcher has a health check handler

smothiki commented 8 years ago

Anyways we are thinking of moving away from fetcher . So please Ignore my previous comment

smothiki commented 8 years ago

If k8s API is not accessible how come it's builder fault . Also it not making sense to me to check if API is accessible or not as builder wont start in the first place. I agree checking S3 storage is accessible or not This is overall amazing code. I feel its more complex design If we are coming up a different health server why do we need an entire class to atomically set a value. We can just write one more handler in the health server that sets the circuit state value and ssh.server.go calls that handler before this https://github.com/deis/builder/blob/master/pkg/sshd/server.go#L73 and health check handle polls change of circuit state till the timeout is over.

smothiki commented 8 years ago

Also liveness probes shouldn't be checking S3 API accessibility

smothiki commented 8 years ago

Already server handles ping request, but ping method wasn't implemented https://github.com/deis/builder/blob/master/pkg/sshd/server.go#L202. I think liveness probe should call ping

arschles commented 8 years ago

If k8s API is not accessible how come it's builder fault

It's not the builder's fault, but the builder shouldn't be able to accept any request to do work when object storage, a dependency without which it can't do its job, is unavailable to it.

If we are coming up a different health server why do we need an entire class to atomically set a value

We don't necessarily need this Circuit type, but it presents the simplest abstraction between the SSH server and the health check server I could find. The basic requirement is that the SSH server can communicate its status to the health server in a concurrency safe manner.

I tried a few other implementations, including a sync.WaitGroup implementation and a channel based implementation, but both ended up being more code with more complexity.

We can just write one more handler in the health server that sets the circuit state value and ssh.server.go calls that handler before this https://github.com/deis/builder/blob/master/pkg/sshd/server.go#L73

I'm not 100% sure I understand what you mean here, but the closure of the circuit happens right above that line here, and that is the logic that effectively communicates to the health check server that the ssh server is ready

IMO It doesn't make as much sense to call an HTTP endpoint than to make a function call from within the same process.

health check handle polls change of circuit state till the timeout is over.

The health check server essentially does poll the circuit state because Kubernetes polls the health check endpoint. I opted not to implement any polling logic in the health check endpoint in favor of just reusing the Kubernetes logic.

Already server handles ping request, but ping method wasn't implemented https://github.com/deis/builder/blob/master/pkg/sshd/server.go#L202. I think liveness probe should call ping

I think implementing the health check logic in the SSH ping handler is suboptimal. Here's my thinking:

arschles commented 8 years ago

@bacongobbler are you referring to the AbsPath("/healthz") bug?

bacongobbler commented 8 years ago

Sorry, that comment was in regards to refactoring out pkg/log to use the standard log package as in this commit. Github doesn't really show well that I was making a comment on that commit.

arschles commented 8 years ago

Ah, got it. I see the tiny text where it says you commented on the commit now.

Filed https://github.com/deis/pkg/issues/19 to track

aledbf commented 8 years ago

Code LGTM

arschles commented 8 years ago

Gonna hold off on merging this until the discussion at https://github.com/deis/builder/pull/149#issuecomment-183126063 is complete

technosophos commented 8 years ago

FWIW, here is my view on health checking logic:

Neither probes are about establishing "fault", the purpose of the probes is to indicate that the system believes itself to be in a position to fulfill its service contract. That's it. If it can't reach a database, cache, API endpoint, etc., it should fail its check.

That said, I am a little concerned with probes being dependent on the state of Kubernetes itself. So the k8s API service availability is sort of a gray area. Originally, I was opposed to failing healthz based on k8s API server problems, but I've changed my mind. Here's the use case that caused me to change my mind:

In this case, builder SHOULD fail because it can no longer fulfill its service contract. If a policy was changed, k8s itself will restart the pod and it will come up successful. Otherwise, it will go into crash backoff, which is exactly what ought to happen.

helgi commented 8 years ago

A liveness probe should not fail if k8s is down or s3 is reachable and here is why:

A liveness probe will cause a pod to restart when those external components fail, which in turn means the user can not even do git push but instead gets a unreachable hostname if there are no builders to service the request.

As a user (developer) I want to do a git push which fails gracefully if dependant services are down with a good error message, along the lines of "unable to process the git push right now, X and Y are down".

Builder is alive and can give good information back to the user, making the liveness probe depend on 3rd party things is a bad UX since the whole builder becomes unavailable at that point. I'm proposing we do the deeper health check during git push and readiness probe but not liveness probe. Different purposes and different external informational needs by devs / operators.

I'd like to hear the opposite side on how we'd handle the scenario of builder being down and telling the developers (and operators) what is going on and how they would effectively debug things when the builder is effectively left in a "running but not ready" state until s3 / k8s / etc comes up?

bacongobbler commented 8 years ago

To chime in my thoughts as well:

In this case, builder SHOULD fail because it can no longer fulfill its service contract. If a policy was changed, k8s itself will restart the pod and it will come up successful. Otherwise, it will go into crash backoff, which is exactly what ought to happen.

This is great for an administrator, but not for a user. In the current implementation the user has no view into why the builder is down. If we were to go down this route, a status page (similar to status pages like http://status.docker.com/) would be invaluable to a user to see what systems are down, bridging the communication gap between users and administrators.

I'd like to hear the opposite side on how we'd handle the scenario of builder being down and telling the developers (and operators) what is going on and how they would effectively debug things when the builder is effectively left in a "running but not ready" state until s3 / k8s / etc comes up?

^^

Another proposed solution offline was to gracefully fail a git push, but internally continuously perform health checks on upstream services like k8s/minio, which would then be logged if they're reporting as down. That way we can pick that up in https://github.com/deis/monitor with alerting (not implemented in telegraf AFAIK but not something that we couldn't do ourselves)

arschles commented 8 years ago

Ok, after extensive internal discussion, consensus is as follows: