Open rajatvig opened 2 months ago
Hi @rajatvig , thanks for raising the issue. There was one change to the Prometheus backend (#288) which may explain the issue you're seeing - all metrics now have a cluster
label, where previously they may not have. This could break queries if the label doesn't match or isn't ignored appropriately. Unfortunately the change was necessary to fix a panic.
Can you share the exact PromQL query?
I did see that PR merged but wasn't able to tie it back to the issue we are seeing. We are not yet running clustered agents.
The full PromQL we use is
100 * (sum(buildkite_queues_running_jobs_count{queue="queue"} + buildkite_queues_scheduled_jobs_count{queue="queue"}) or vector(0))
That gives us a count of running and scheduled jobs that help us determine how many agents we need to run. While the buildkite_queues_scheduled_jobs_count
metric was fine, the metric buildkite_queues_running_jobs_count
did not go to 0 when there were no builds running.
I see, interesting. The metric being stuck could be related to #296, which removed a well-intended but heavy-handed gauge reset. Is the metric stuck for all queues, or a particular queue? Is it stuck for queues that were deleted?
It was stuck for queues that were deleted, i.e. no builds were running.
Sounds like #305 should fix it - I'll optimistically close this as fixed, please give v5.9.9 a try and feel free to re-open if you see the same issue.
I just gave 5.9.9 a try and still seeing similar behaviour. I setup 2 jobs on the test
queue and the metric buildkite_queues_running_jobs_count{queue="test"}
went to 2 and then to 1 but did not go to 0 or absent like earlier.
Issue Details
Post an upgrade from 5.9.4 to 5.9.8, we noticed that the metrics for running builds are not getting updated after the builds complete. This behaviour causes a change in scaling behavior as metric calculation we use sums running and scheduled builds for a queue to decide if there are enough agents running. The metric we see the same value for is
buildkite_queues_running_jobs_count
.Setup
We are running unclustered agents and running the agent metrics binary to export metrics to Prometheus.