Add a /cluster_health endpoint

bloomberg / goldpinger

Debugging tool for Kubernetes which tests and displays connectivity between nodes in the cluster.

Apache License 2.0

2.5k stars 178 forks source link

Add a /cluster_health endpoint #101

Closed seeker89 closed 3 years ago

seeker89 commented 3 years ago

This adds a new endpoint, that returns 200 OK, if:

all peers report OK on /check call
all peers called the same set of peers

It also returns some basics to know where to start when OK is false.

The actual implementation is in ./pkg/goldpinger/client.go, the rest is due to updated swagger codegen.

skamboj commented 3 years ago

Other than some minor nitpicking, this looks good - looking forward to taking it for a spin now :)

skamboj commented 3 years ago

So I tried to take this for a spin and I got an unhealthy cluster with no clue about what is broken (And goldpinger/the cluster seems healthy otherwise):

$ http_proxy= curl -v http://goldpinger.sk1../cluster_health
*   Trying 10.x.x.x...
* TCP_NODELAY set
* Connected to goldpinger.sk1... (10.x.x.x) port 80 (#0)
> GET /cluster_health HTTP/1.1
> Host: goldpinger.sk1...
> User-Agent: curl/7.58.0
> Accept: */*
>
< HTTP/1.1 418 I'm a teapot
< Server: nginx/1.17.10
< Date: Tue, 16 Mar 2021 17:13:08 GMT
< Content-Type: application/json
< Content-Length: 227
< Connection: keep-alive
<
{"OK":false,"duration-ns":22428592,"generated-at":"2021-03-16T17:13:08.479Z","nodesHealthy":["10.x.x.x","10.x.x.x","10.x.x.x","10.x.x.x","10.x.x.x","10.x.x.x"],"nodesTotal":6,"nodesUnhealthy":null}
* Connection #0 to host goldpinger.sk1... left intact

(Hosts and ip addresses masked)

seeker89 commented 3 years ago

So I tried to take this for a spin and I got an unhealthy cluster with no clue about what is broken (And goldpinger/the cluster seems healthy otherwise):

$ http_proxy= curl -v http://goldpinger.sk1../cluster_health
*   Trying 10.x.x.x...
* TCP_NODELAY set
* Connected to goldpinger.sk1... (10.x.x.x) port 80 (#0)
> GET /cluster_health HTTP/1.1
> Host: goldpinger.sk1...
> User-Agent: curl/7.58.0
> Accept: */*
>
< HTTP/1.1 418 I'm a teapot
< Server: nginx/1.17.10
< Date: Tue, 16 Mar 2021 17:13:08 GMT
< Content-Type: application/json
< Content-Length: 227
< Connection: keep-alive
<
{"OK":false,"duration-ns":22428592,"generated-at":"2021-03-16T17:13:08.479Z","nodesHealthy":["10.x.x.x","10.x.x.x","10.x.x.x","10.x.x.x","10.x.x.x","10.x.x.x"],"nodesTotal":6,"nodesUnhealthy":null}
* Connection #0 to host goldpinger.sk1... left intact

(Hosts and ip addresses masked)

Thanks. I forgot to set the default to true 🤦

I simplify a little bit too

erhudy commented 3 years ago

One thought before this is merged: can we also expose this as a Prometheus metric? This would make it really easy to hook up a simple alert where Goldpinger is telling us something is amiss with a cluster, and then we can jump over to the cluster in question and do a more in-depth analysis.

seeker89 commented 3 years ago

@skamboj thanks, sorry I was trying to wing it from the UI :)

@erhudy any other wishes before this goes in?

erhudy commented 3 years ago

ENGAGE