Open pathcl opened 4 months ago
Hi @pathcl with https://github.com/cloudbase/garm/pull/217 i've also introduced metrics for the runner
package (documentation: https://github.com/cloudbase/garm/blob/main/doc/config_metrics.md#runner-metrics)
we are already running a patched version of v0.1.4
where we cherry-picked some of the changes (and #217 is in there) we wanted on our side. (feel free to build our patched garm-version by your own and give them a try - all patches are already part of main
branch in garm itself)
Out of curiosity: do you want to have more (from a metrics point of view) metrics or is this exactly what you are looking for?
promql-query:
(
sum by (operation, provider) (
rate(
garm_runner_errors_total{app_kubernetes_io_instance="garm-prod",app_kubernetes_io_name="garm"}[5m]
)
)
or
sum by (operation, provider) (
garm_runner_operations_total{app_kubernetes_io_instance="garm-prod",app_kubernetes_io_name="garm"}
*
0
)
)
/
sum by (operation, provider) (
rate(
garm_runner_operations_total{app_kubernetes_io_instance="garm-prod",app_kubernetes_io_name="garm"}[5m]
)
)
*
100
We'd like to understand more about runner's && providers.
We have metrics for the GH API calls, but no metrics for provider calls. We don't currently see if a runner just failed to reach idle state and is just recreated over and over due to the bootstrap timeout.
Let's try to add metrics for provider calls.