Managing a large agent estate

petemounce commented 5 years ago

Currently, https://buildkite.com/organizations/improbable/agents is what is available to manage agents.

I have currently low hundreds of agents, 8x to a node, single platform. I'm soon going to introduce Windows, 1x to a node, then macOS.

I tag agents with:

environment (staging/production, maybe testing later)
capable_of_building (this will have low dozens cardinality)
distribution (currently ubuntu, will have windows and macOS variants)
distribution_major_version (currently 16, will have 18, and some windows/macOS variants)
gcp*
os_family (currently debian, will have some windows/macOS variants)
platform (currently linux, will have some windows/macOS variants)
queue (is a string that looks like v-0bd35eaddf77f8e2-------1534253424 that is a watermark of the revision from source control the agent node image was built from - will have tens to hundreds of these)
arbitrary software=version - things like docker, node, etc

I would find this page more valuable if

I could query with more than string-matching - I think at the moment it does a string-contains search?
- I want to be able to query by AND/OR tags together. Probably AND is most useful since that's I think the only option when targeting steps to agents (?).
I could see the results in something more useful than a long list that I have to then scroll or find-on-page through.
- I had vaguely thought a treeview - but I really didn't come up with something that's general enough for a product
- maybe seeing a set of barcharts or histograms, which dynamically reduce as filters are applied?
- what I want to achieve is to find, usually
- are there any agents that can satisfy the steps I see scheduled? If not, something's up with our autoscaler
- agents that can satisfy what I want to write a step to do
- I wonder if agents are systemically failing in europe-west1-d?

lox commented 5 years ago

This is really interesting context @petemounce, thanks for this.

I wonder if this might be a good use-case for the new GraphQL console and saved queries? We expose a lot of this information programatically and GraphQL would be a good way to query it. Would be keen to talk through how to make that more usable for you.

At present the Agent UI starts to decline in usefulness at around the 100+ agent point, @ticky made some solid progress on the the last round of UX improvements on that page, she might have some ideas about next steps too.

petemounce commented 5 years ago

That sounds possible but also quite low-level. I think I'm after something that allows me to visualise the characteristics of my agent fleet.

petemounce commented 5 years ago

One thing that just occurred to me is that just as I want to find and stomp (ok, "improve") flaky tests, I want to find and stomp flaky agents (which, hopefully for me with isolated queues per agent image, means queue, not individual agents).

Making it easier to see the failure modes (which steps of which pipelines) grouped by agent and agent-tags would make that significantly easier.

buildkite / feedback

Managing a large agent estate #447