Open zetaab opened 11 months ago
I have now disabled high cardinality metrics with relabelings. However, I would like to see removal of job_name
and job_workflow_ref
from all of these metrics even possibility to configure that. These might work in environment which do have like 10 builds per day but we do have more than thousand per hour.
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: listeners
labels:
app.kubernetes.io/part-of: gha-runner-scale-set
spec:
selector:
matchLabels:
app.kubernetes.io/part-of: gha-runner-scale-set
podMetricsEndpoints:
- port: metrics
relabelings:
- action: drop
regex: 'gha_job_(execution|startup)_duration_seconds'
- action: drop
regex: 'gha_completed_jobs_total|gha_started_jobs_total'
The labels on both gha_job_execution_duration_seconds and gha_job_startup_duration_seconds metrics mean that a new bucket is created for every run on every job, this means that every bucket will only ever contain a 0 or a 1. You cannot get meaningful information out of these metrics.
Prometheus is unable to aggregate metrics before applying rate() on them to produce histograms, so with the current layout of these metrics it is impossible to produce a histogram of startup or execution durations.
Information should be put into buckets based on job_name, organisation, and repository only. Highly unique labels such as runner_id, runner_name, and job_workflow_ref should be removed.
Also, it's very likely to get scrape errors after many builds:
"http://10.2.0.23:8080/metrics" exceeds -promscrape.maxScrapeSize=16777216
@nikola-jokic is there any possibility that this could end under development?
For my use case, this is definitely important metrics to expose. I'd love to have the metrics to be able to monitor the workflows properly, but the current setup makes it close to impossible.
If we remove the highly unique labels, and add a label for the name that would probably solve everything.
Hi, any updates on this ? we're experiencing the same issue and had to turn off the metrics altogether because opentelemetry agents couldn't handle the data.
This may have been because I am using grafana-alloy
to do the scrapes, but in order to remove the metrics I had to use this relabeling config:
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: gha-runner-gha-rs-listeners
labels:
release: grafana-alloy
spec:
selector:
matchLabels:
"app.kubernetes.io/component": runner-scale-set-listener
podMetricsEndpoints:
- port: metrics
metricRelabelings:
- action: drop
regex: "gha_job_(execution|startup)_duration_seconds"
sourceLabels: ["__name__"]
- action: drop
regex: "gha_completed_jobs_total|gha_started_jobs_total"
sourceLabels: ["__name__"]
This uses the metricsRelabelings
from: https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#podmetricsendpoint
The positive is these relabeling rules happen before ingestion
metricRelabelings
configures the relabeling rules to apply to the samples before ingestion.
We're suffering from this as well, with Datadog scraping the metrics endpoint. The payload returned just grows infinitely the longer the listener runs with all the jobs we run constantly. We had to significantly increase the memory allocated to our datadog agent pods just so that it had enough memory to load and process the returned metrics without being oomkilled, and even then we still hit the cap of max metrics the agent will ingest within a few hours of the listener pod starting.
Even aside from the metrics cardinality problem, I notice the memory usage of the listener pod increases steadily over time, possibly related to keeping track of all these metrics. I can see the memory usage grow from ~20mb to over 700mb for one listener pod over a couple weeks.
Digging into this a bit more, it seems like this is related to the deliberate decision on the part of prometheus/client_golang to never expire old metrics. (I had hoped maybe there would be an optional flag to expire these similar to statsd's config.deleteCounters
but that doesn't appear to be an option.)
Which suggests the only options are 1) don't include high cardinality tags, or 2) explicitly implement a solution to Unregister these metrics after they're no longer being updated. I suppose there is 3) constantly restart the controller but that's untenable as we'd have to restart it every few hours.
We're facing the same issue using only ephemeral runners.
https://github.com/actions/actions-runner-controller/pull/3556 would fix the runner_name
and runner_id
cardinality but it would be nice to customize this as we don't even care about job_workflow_ref
and the branch, just want to monitor count/duration of repo+job over time :-)
I have modified the Datadog openmetrics annotation to ignore certain labels:
"exclude_labels": [
"runner_id",
"runner_name",
"enterprise",
"organization",
"job_result",
"job_workflow_ref",
"event_name"
]
Unfortunately this doesn't fix the maximum metric scrape count of 2000
and the only way to fix this is to restart the listener pods...
Checks
Controller Version
0.7.0
Deployment Method
Helm
Checks
To Reproduce
metrics: controllerManagerAddr: ":8080" listenerAddr: ":8080" listenerEndpoint: "/metrics"
Describe the bug
Listeners contains metrics which is great. However, the cardinality of these metrics are just something that is not going to work. There needs to be way to disable high cardinality labels from the metrics.
so just during one hour we got over 15k new timeseries to our prometheus which is going to explode if we keep these enabled even 12 hours.
Describe the expected behavior
The expected behaviour is that high cardinality labels is REMOVED from the metrics. Also metrics buckets should be configurable in case of bucket metrics.
Lets take the worst metric
gha_job_execution_duration_seconds_bucket
it do have
job_workflow_ref
which is basically almost always unique, which means that it will always create new timeseries to prometheus which is really expensive. Worth of thinking is evenjob_name
needed. I would like to disable both of these labels. Also number of the default buckets are something which is going to explode our prometheus.Additional Context
Controller Logs
Runner Pod Logs