cloudbase / garm

GitHub Actions Runner Manager
Apache License 2.0
135 stars 26 forks source link

add metrics for providers #273

Open pathcl opened 4 months ago

pathcl commented 4 months ago

We'd like to understand more about runner's && providers.

We have metrics for the GH API calls, but no metrics for provider calls. We don't currently see if a runner just failed to reach idle state and is just recreated over and over due to the bootstrap timeout.

Let's try to add metrics for provider calls.

bavarianbidi commented 4 months ago

Hi @pathcl with https://github.com/cloudbase/garm/pull/217 i've also introduced metrics for the runner package (documentation: https://github.com/cloudbase/garm/blob/main/doc/config_metrics.md#runner-metrics)

we are already running a patched version of v0.1.4 where we cherry-picked some of the changes (and #217 is in there) we wanted on our side. (feel free to build our patched garm-version by your own and give them a try - all patches are already part of main branch in garm itself)

Out of curiosity: do you want to have more (from a metrics point of view) metrics or is this exactly what you are looking for?

image

promql-query:

    (
        sum by (operation, provider) (
          rate(
            garm_runner_errors_total{app_kubernetes_io_instance="garm-prod",app_kubernetes_io_name="garm"}[5m]
          )
        )
      or
        sum by (operation, provider) (
            garm_runner_operations_total{app_kubernetes_io_instance="garm-prod",app_kubernetes_io_name="garm"}
          *
            0
        )
    )
  /
    sum by (operation, provider) (
      rate(
        garm_runner_operations_total{app_kubernetes_io_instance="garm-prod",app_kubernetes_io_name="garm"}[5m]
      )
    )
*
  100