Can we use datadog to track the performance of specific jobs over time?

yob commented 5 years ago

We've just started exploring the datadog integration added in #832.

It was super easy to start getting the data to appear. Here's some early charts for buildkite.jobs.duration.success, broken out by pipeline

screenshot from 2018-12-11 00-12-32

The first use case I thought I'd try was "how will the cucumber job on pipeline x change over time", and I was a bit sad when I discovered the job name isn't set as a tag so I can't break out a chart at that level (current tags are agent_name, org, pipeline, branch, exit_code).

There's probably a good reason for that (maybe avoiding too many custom tags to keep the cost down?), but I thought I'd flag is as something we'd find super useful.

lox commented 5 years ago

Thanks for trying out the new integration @yob! We're trying to keep the number of tags down, but tracking performance of specific steps is basically the primary thing we are hoping to achieve in the medium term.

The tricky bit is the relationship between steps and jobs, especially when parallelism is involved. For instance in our main pipeline we have an Rspec step with a parallelism of 40. We're trying to figure out how to represent these sorts of setups in metrics 🤔

yob commented 5 years ago

tracking performance of specific steps is basically the primary thing we are hoping to achieve in the medium term.

That's good to hear!

My ideal would be to replicate something like the metrics we had at The Conversation.

We tracked the overall build time, which was helpful for keeping the total time a developer had to wait down to a minimum. Presumably that'd be hard to replicate with agent level metrics though.

We also tracked per-job metrics, to identify regressions that were specific to a job. We didn't have parallel jobs though, so I hadn't really considered the complexity they would add :thinking:

The tricky bit is the relationship between steps and jobs, especially when parallelism is involved. For instance in our main pipeline we have an Rspec step with a parallelism of 40. We're trying to figure out how to represent these sorts of setups in metrics

I feel like it might be fine to track the 40 jobs as a single number (like your current metrics, with mean/median/95th percentile). It wouldn't help identify a single spec with a speed regression, but something like the postgres issue we mention in the TC blog post (linked above) should be pretty obvious in the mean or median over 40 jobs.

nickbrowne commented 5 years ago

We just ran into this issue as well, it makes the datadog metrics emitted by the buildkite agent hard to use in practice, at least with the way we've set up our pipelines.

For example the pipeline for our largest app looks something like this:

Build and upload docker image (blocking, 30 seconds)

Then, split the following steps across as many agents as are available

ESLint (10 seconds)
Rubocop (10 seconds)
Security Audit (10 seconds)
Translations Audit (30 seconds)
JS tests (30 seconds)
Ruby unit tests (5 minutes)
Ruby acceptance tests (5 minutes)

So we end up with graphs like this (this is two builds interleaved, making it extra confusing):

Screenshot at 2019-07-16 16-51-06

By rolling up and smoothing that data you can get something somewhat reasonable, but that also makes it very hard to pinpoint problems, and not so easy to set up monitors that look for outliers which can trigger in a timely manner and let us know that a particular machine or step is misbehaving.

I think ideally we also want to be able to differentiate each metric by what the step was - which we currently do with some custom code where we emit our own metrics to statsd, but we had hoped we could move to something simpler and more standard :slightly_smiling_face:

yob commented 5 years ago

Here's where we've ended up for now: a sinatra app that accepts buildkite webhooks and converts them to datadog traces.

It doesn't handle the parallel jobs issue raised by @lox above, but otherwise it's working really well. The nifty bit is the waterfall charts on datadog, they really help identify where bottlenecks in the build are.

build-waterfall

lox commented 5 years ago

That is awesome @yob. How are you breaking up the jobs into the sub-sections?

yob commented 5 years ago

I'm not really. I record a span for the overall build duration, and child spans for all completed jobs.

In the image embedded in the readme, all the jruby jobs (near the end of the build) are parallel jobs from a single step.

nickbrowne commented 5 years ago

That's a neat idea to use tracing with spans instead of just metrics, will definitely be switching to that at some point. Looks like you can do decent monitoring even on individual spans too :+1:

yob commented 5 years ago

I can't take any credit for the idea - I sole it from @petemounce :)

goodspark commented 4 years ago

FWIW, tracing support is added in the #1273 PR

petemounce commented 4 years ago

I'm sad that we didn't start developing in the open, but yeah - https://buildkite-community.slack.com/archives/C02T53V9H/p1545224770171700 and then https://improbable.io/blog/continuous-integration-at-improbable. Good times :)

petemounce commented 4 years ago

@nickbrowne on the offchance you might not have seen it, take a look at the docker-ecr-cache-buildkite-plugin. That will allow you to reduce away the 30s docker build to a probably-much-faster docker pull of a cached image.

Counterintuitively, keep the bottleneck step to avoid parallel steps racing to build the same container of it changes.

keithduncan commented 3 years ago

Going to close this thanks to the amazing work of @goodspark in #1273 :bow: Thanks again!

buildkite / agent

Can we use datadog to track the performance of specific jobs over time? #869