concourse / hush-house

Concourse k8s-based environment
https://hush-house.pivotal.io
29 stars 23 forks source link

metrics: capture error metrics from logs for `ci` environment #78

Closed cirocosta closed 4 years ago

cirocosta commented 4 years ago

Hey,

Since the early inception of hush-house, we've been capturing error rates for workers through the use of Stackdriver's user defined log-based metrics (see Overview of logs-based metrics)

Screen Shot 2019-11-25 at 9 57 24 AM

That has served us very well as it'd put 0 load in our Prometheus server as those were coming all from stackdriver, as well as shift our approach to serching for "when did errors start?" to a much more direct "just look at the dashboard".

Screen Shot 2019-11-25 at 10 00 16 AM

Now that we added nci to the hush-house GKE cluster, which is also hooked up w/ stackdriver for logs, it'd be great to leverage the same capabilities.

The only problem that we face with it is the fact that it has hush-house hardcoded in the log filtering.

It might be possible to have that as a wildcard (on Stackdriver) and perform the filtering at the client (Grafana), but I personally never verified that.

Thanks!

cirocosta commented 4 years ago

thiis is actually done!

Screen Shot 2020-01-29 at 1 15 15 PM

https://github.com/concourse/hush-house/pull/114