buildkite / feedback

Got feedback? Please let us know!
https://buildkite.com
25 stars 24 forks source link

Feature request: Build agent telemetry #331

Open gh2k opened 6 years ago

gh2k commented 6 years ago

I'd like to know the load on my build agent so that I can see how hard it is working on a build, and therefore if it has enough resources. I can do this by logging in to the build agent, but I prefer to keep them as disposable as possible. It would be helpful to see load average, CPU and/or memory usage on the agent on its status page.

toolmantim commented 6 years ago

Thanks @gh2k for taking the time to submit this. What platform/setup are you using for your agents?

We've investigated adding this before, but we've chosen to keep it out of the agent itself, to help ensure that it can stay cross-platform and up-to-date. And also, because a lot of teams have their own preferred tools for system monitoring. If you use the Elastic CI Stack, you'll get all this in CloudWatch for example.

(related: #233)

gh2k commented 6 years ago

I'm using a hodgepodge of Windows and Linux agents running in-house on some vmware boxes where we have spare capacity. Quite a few are on free ESXi installs so we don't have them hooked up to a vcenter. As such, it takes some poking around to find the agent that's running the build.

CPU%, mem-free and network-usage-in-mbps would be ideal metrics to have and which ought to be available on all platforms. Automatically displaying the internal IPs would help too, to save having my init scripts fish them out and add them to the metadata when starting the agent. All would help me find problems with the agent I'm interested in without having to hunt around quite so much.

petemounce commented 6 years ago

I used a handy ansible module to install the prometheus node exporter; this then gives me system and process-level metrics with some limited context (eg process name, disk id, things like that) that I can then query and alert on.

That works on linux and macOS. There's a Windows one too.

However, I'd love to see job-level and agent-level statistics (especially since we run more than one agent on each of our nodes, so having that is essential for graceful autoscaling - ie, something that doesn't brutalise in-flight jobs).

I guess what I'm saying is that I don't mind that buildkite doesn't have the node-level stuff, since I can get it easily via an existing prometheus setup. Personally, I would prefer if buildkite concentrate on docs to illustrate how one could achieve insight & observability of the agents, but not provide a monitoring system (by which I mean metrics storage, charting, querying) as part of the package.