alexliesenfeld / health

A simple and flexible health check library for Go.
MIT License
774 stars 38 forks source link

Add prometheus metric #57

Open tliefheid opened 1 year ago

alexliesenfeld commented 1 year ago

Thanks! I'll look into it soon!

tliefheid commented 1 year ago

not sure how you imagine this to work. Cause who will register the metrics (gauges/counters) cause that is how prometheus works. if you expect it so send back some details, then it is still up to the user to set the gauges and counters. imho prometheus is one of the most commonly used metrics tools out there. i'm happy to update this, don't get me wrong. i'm not that familiar as you with the whole concept of the interfaces you have. just getting some more background info and context from you :)

akhy commented 1 year ago

Agree, adding a bunch of extra deps to go.mod is not a good idea for this small lib. I also like @alexliesenfeld 's idea to make the metrics lib implement ResultWriter interface in a separate project.

About the metrics format, I don't feel mapping discrete up/down/unknown to 0/1/2 value is right. I'm still not sure but I think I'm going to make my own format which is more or less like this:

# if service1 is up
app_name_health{name="service1", status="up"} 1
app_name_health{name="service1", status="down"} 0
app_name_health{name="service1", status="unknown"} 0

# if service1 is down
app_name_health{name="service1", status="up"} 0
app_name_health{name="service1", status="down"} 1
app_name_health{name="service1", status="unknown"} 0

# if service1 is unknown
app_name_health{name="service1", status="up"} 0
app_name_health{name="service1", status="down"} 0
app_name_health{name="service1", status="unknown"} 1

This way, we can create a fine grained and unambiguous prom query/aggregation.

tliefheid commented 1 year ago

a format like that will result in more timeseries overall (3 for each check), also you need to reset all the gauges before you update the correct one, it's a bit more complex to do.

i get that you don't want to inject into the lib itself, but i'm not sure how to do this

alexliesenfeld commented 1 year ago

not sure how you imagine this to work. Cause who will register the metrics (gauges/counters) cause that is how prometheus works. if you expect it so send back some details, then it is still up to the user to set the gauges and counters. imho prometheus is one of the most commonly used metrics tools out there. i'm happy to update this, don't get me wrong. i'm not that familiar as you with the whole concept of the interfaces you have. just getting some more background info and context from you :)

Sure, I can provide some more details. What I was referring to was that there is a ResultWriter interface that the library uses to write a check result into a specific format. For example, the default JSONResultWriter takes the current check status and transforms it into a JSON HTTP response.

https://github.com/alexliesenfeld/health/blob/09f6eabefbfdb6f4ee9dfaf4caf122720aa55da6/handler.go#L49-L64

The same way, we probably could have a PrometheusResultWriter that takes the overall check result and transforms it into a format that can be processed by Prometheus.

The user can then configure the library to use the PrometheusResultWriter rather than the default one by configuring it similar to this example:

https://github.com/alexliesenfeld/health/blob/09f6eabefbfdb6f4ee9dfaf4caf122720aa55da6/examples/showcase/main.go#L105-L109

@TomL-dev What do you think?

akhy commented 1 year ago

Just taken a look at WithResultWriter. Turns out it doesn't seem appropriate for my use case, because it's directly produce HTTP responses. I want to have it easily mixed along with metrics from another library.

So I created a lib myself: https://github.com/chickenzord/go-health-prometheus

Please let me know what you think, thanks

alexliesenfeld commented 1 year ago

Looks nice! As of now it seems to only expose the aggregated/overall system status. I wonder if it would help if health.CheckState would actually contain all the component check results as well. AFAICS, this way you could expose components individually as well.

akhy commented 1 year ago

@alexliesenfeld it's already showing per-component healths...

actually I haven't found a way to get the overall status 😅 for the aggregated/overall, instead of only up/down/unknown, I'd love to add one more degraded status if any non-critical component is down.. but I think that should be out of the scope of this lib because the real implementation can be different for each use cases 🤔

tliefheid commented 1 year ago

@alexliesenfeld it's already showing per-component healths...

actually I haven't found a way to get the overall status 😅 for the aggregated/overall, instead of only up/down/unknown, I'd love to add one more degraded status if any non-critical component is down.. but I think that should be out of the scope of this lib because the real implementation can be different for each use cases 🤔

you can solve this by creating prom queries and generate alerts if the result (count or something) drops below a x percentage

fr3fou commented 1 year ago

Any updates on this? :D One more thing I think would be useful – keeping track of how long each probe takes to respond