actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.77k stars 1.13k forks source link

High cardinality of metrics #3153

Open zetaab opened 11 months ago

zetaab commented 11 months ago

Checks

Controller Version

0.7.0

Deployment Method

Helm

Checks

To Reproduce

  1. enable metrics to controller by using:

metrics: controllerManagerAddr: ":8080" listenerAddr: ":8080" listenerEndpoint: "/metrics"

  1. you need to have enough high traffic build environment that it will build a lot from different repos and prs

Describe the bug

Listeners contains metrics which is great. However, the cardinality of these metrics are just something that is not going to work. There needs to be way to disable high cardinality labels from the metrics.

Screenshot 2023-12-14 at 18 19 39 Screenshot 2023-12-14 at 18 19 52

so just during one hour we got over 15k new timeseries to our prometheus which is going to explode if we keep these enabled even 12 hours.

Describe the expected behavior

The expected behaviour is that high cardinality labels is REMOVED from the metrics. Also metrics buckets should be configurable in case of bucket metrics.

Lets take the worst metric gha_job_execution_duration_seconds_bucket

it do have job_workflow_ref which is basically almost always unique, which means that it will always create new timeseries to prometheus which is really expensive. Worth of thinking is even job_name needed. I would like to disable both of these labels. Also number of the default buckets are something which is going to explode our prometheus.

Additional Context

replicaCount: 2

flags:
  logLevel: "debug"

metrics:
  controllerManagerAddr: ":8080"
  listenerAddr: ":8080"
  listenerEndpoint: "/metrics"

Controller Logs

not relevant

Runner Pod Logs

not relevant
zetaab commented 11 months ago

I have now disabled high cardinality metrics with relabelings. However, I would like to see removal of job_name and job_workflow_ref from all of these metrics even possibility to configure that. These might work in environment which do have like 10 builds per day but we do have more than thousand per hour.

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: listeners
  labels:
    app.kubernetes.io/part-of: gha-runner-scale-set
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: gha-runner-scale-set
  podMetricsEndpoints:
  - port: metrics
    relabelings:
    - action: drop
      regex: 'gha_job_(execution|startup)_duration_seconds'
    - action: drop
      regex: 'gha_completed_jobs_total|gha_started_jobs_total'
thomassandslyst commented 6 months ago

The labels on both gha_job_execution_duration_seconds and gha_job_startup_duration_seconds metrics mean that a new bucket is created for every run on every job, this means that every bucket will only ever contain a 0 or a 1. You cannot get meaningful information out of these metrics.

Prometheus is unable to aggregate metrics before applying rate() on them to produce histograms, so with the current layout of these metrics it is impossible to produce a histogram of startup or execution durations.

Information should be put into buckets based on job_name, organisation, and repository only. Highly unique labels such as runner_id, runner_name, and job_workflow_ref should be removed.

notz commented 5 months ago

Also, it's very likely to get scrape errors after many builds:

"http://10.2.0.23:8080/metrics" exceeds -promscrape.maxScrapeSize=16777216

zetaab commented 5 months ago

@nikola-jokic is there any possibility that this could end under development?

realmunk commented 4 months ago

For my use case, this is definitely important metrics to expose. I'd love to have the metrics to be able to monitor the workflows properly, but the current setup makes it close to impossible.

If we remove the highly unique labels, and add a label for the name that would probably solve everything.

th-le commented 3 months ago

Hi, any updates on this ? we're experiencing the same issue and had to turn off the metrics altogether because opentelemetry agents couldn't handle the data.

alecrajeev commented 1 month ago

This may have been because I am using grafana-alloy to do the scrapes, but in order to remove the metrics I had to use this relabeling config:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: gha-runner-gha-rs-listeners
  labels:
    release: grafana-alloy
spec:
  selector:
    matchLabels:
      "app.kubernetes.io/component": runner-scale-set-listener
  podMetricsEndpoints:
    - port: metrics
      metricRelabelings:
        - action: drop
          regex: "gha_job_(execution|startup)_duration_seconds"
          sourceLabels: ["__name__"]
        - action: drop
          regex: "gha_completed_jobs_total|gha_started_jobs_total"
          sourceLabels: ["__name__"]

This uses the metricsRelabelings from: https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#podmetricsendpoint

The positive is these relabeling rules happen before ingestion

metricRelabelings configures the relabeling rules to apply to the samples before ingestion.

mikespharss commented 1 month ago

We're suffering from this as well, with Datadog scraping the metrics endpoint. The payload returned just grows infinitely the longer the listener runs with all the jobs we run constantly. We had to significantly increase the memory allocated to our datadog agent pods just so that it had enough memory to load and process the returned metrics without being oomkilled, and even then we still hit the cap of max metrics the agent will ingest within a few hours of the listener pod starting.

Even aside from the metrics cardinality problem, I notice the memory usage of the listener pod increases steadily over time, possibly related to keeping track of all these metrics. I can see the memory usage grow from ~20mb to over 700mb for one listener pod over a couple weeks.

mikespharss commented 1 month ago

Digging into this a bit more, it seems like this is related to the deliberate decision on the part of prometheus/client_golang to never expire old metrics. (I had hoped maybe there would be an optional flag to expire these similar to statsd's config.deleteCounters but that doesn't appear to be an option.)

Which suggests the only options are 1) don't include high cardinality tags, or 2) explicitly implement a solution to Unregister these metrics after they're no longer being updated. I suppose there is 3) constantly restart the controller but that's untenable as we'd have to restart it every few hours.

yvesjans commented 1 month ago

We're facing the same issue using only ephemeral runners.

https://github.com/actions/actions-runner-controller/pull/3556 would fix the runner_name and runner_id cardinality but it would be nice to customize this as we don't even care about job_workflow_ref and the branch, just want to monitor count/duration of repo+job over time :-)

I have modified the Datadog openmetrics annotation to ignore certain labels:

"exclude_labels": [
  "runner_id",
  "runner_name",
  "enterprise",
  "organization",
  "job_result",
  "job_workflow_ref",
  "event_name"
]

Unfortunately this doesn't fix the maximum metric scrape count of 2000 and the only way to fix this is to restart the listener pods...