buildkite / buildkite-agent-metrics

A command-line tool (and Lambda) for collecting Buildkite agent metrics
MIT License
66 stars 53 forks source link

Regression in the 5.9.8 Release #304

Open rajatvig opened 2 months ago

rajatvig commented 2 months ago

Issue Details

Post an upgrade from 5.9.4 to 5.9.8, we noticed that the metrics for running builds are not getting updated after the builds complete. This behaviour causes a change in scaling behavior as metric calculation we use sums running and scheduled builds for a queue to decide if there are enough agents running. The metric we see the same value for is buildkite_queues_running_jobs_count.

Setup

We are running unclustered agents and running the agent metrics binary to export metrics to Prometheus.

DrJosh9000 commented 1 month ago

Hi @rajatvig , thanks for raising the issue. There was one change to the Prometheus backend (#288) which may explain the issue you're seeing - all metrics now have a cluster label, where previously they may not have. This could break queries if the label doesn't match or isn't ignored appropriately. Unfortunately the change was necessary to fix a panic.

Can you share the exact PromQL query?

rajatvig commented 1 month ago

I did see that PR merged but wasn't able to tie it back to the issue we are seeing. We are not yet running clustered agents.

The full PromQL we use is

100 * (sum(buildkite_queues_running_jobs_count{queue="queue"} + buildkite_queues_scheduled_jobs_count{queue="queue"}) or vector(0))

That gives us a count of running and scheduled jobs that help us determine how many agents we need to run. While the buildkite_queues_scheduled_jobs_count metric was fine, the metric buildkite_queues_running_jobs_count did not go to 0 when there were no builds running.

DrJosh9000 commented 1 month ago

I see, interesting. The metric being stuck could be related to #296, which removed a well-intended but heavy-handed gauge reset. Is the metric stuck for all queues, or a particular queue? Is it stuck for queues that were deleted?

rajatvig commented 1 month ago

It was stuck for queues that were deleted, i.e. no builds were running.

DrJosh9000 commented 1 month ago

Sounds like #305 should fix it - I'll optimistically close this as fixed, please give v5.9.9 a try and feel free to re-open if you see the same issue.

rajatvig commented 1 month ago

I just gave 5.9.9 a try and still seeing similar behaviour. I setup 2 jobs on the test queue and the metric buildkite_queues_running_jobs_count{queue="test"} went to 2 and then to 1 but did not go to 0 or absent like earlier.