CI is slow and prone to failures

lukeheath commented 2 months ago

Fleet version: main

💥 Actual behavior

Over time our CI pipeline has built up quite a few workflows, and we haven't done much optimizing. Some low hanging fruit:

1) the goreleaser-fleet.yaml workflow failed today due to running out of disk space. I upgraded the runner to 4-core and it passed, but we noticed that we're still using Ubuntu 20.03, and we should update to the latest. We just didn't want to do that in the middle of publishing a release without testing first.

2) Go tests (and all tests generally) take a long time to complete, which equals a lot of waiting. Would it be worthwhile for us to parallelize them, or upgrade the runners on some? It may be worth an investment here.

🕯️ More info (optional)

N/A

lukeheath commented 2 months ago

@sharon-fdm Heads up, I'm assigning this to @iansltx to take a look at after his current task. We've been having a fair amount of failures and general slowness in CI for awhile, and Ian is experienced with these processes such that he can look for ways to improve overall speed and reliability, which would be a huge force multiplier for us given how often we wait for CI to finish. Also good opportunity for him to ramp up on our build processes.

sharon-fdm commented 2 months ago

@lukeheath, sounds good. Will give Ian a chance to look at our environment too. 👍

lukeheath commented 2 months ago

@iansltx Please put a story point estimate (best guess, I know it's super vague) using this scale.

Tip: If you don't want to use the ZenHub UI, there's browser extensions for ZenHub that will inject some of their UI elements (like estimates) into the GitHub UI.

iansltx commented 2 months ago

Re: pointing, will do. Gut says 8 or 13 due to discovery requirements but I'll get a better read here in a bit.

Re: ZenHub, it looks like progress is being made there, so fingers crossed I'll be able to interact with this issue in there sooner rather than later.

iansltx commented 2 months ago

Looking at this a bit more closely, the scope here is a bit tighter than I initially thought (so 8 points looks a little more likely). Setting up this checklist in order of what I'm thinking so far.

First order of business is trying builds with Ubuntu 22.04 and 24.04 bases and seeing anything breaks.

Next up, question is whether it's worth the money to get Blacksmith or Depot for CI rather than GitHub. Given that we're using bigger (paid, right?) runners anyway despite bieng a public repo, this likely gets us more perf for less money, though I don't expect these will work for every job.

Beyond that, there are a few ways we could go to optimize end-to-end CI tims for a given type of change. The things that stick out to me are:

We're doing various CI steps on changes that are irrelevant for those steps (e.g. handbook changes triggering stuff to do with golang code)
We should be able to make use of caching for JS deps in particular. Given that we have a Yarn lock file, caching node_modules based on lockfile hash should get us a perf uplift any place we do yarn install, though as I recall GitHub's caching isn't exactly the fastest thing in the world.
DB migrations take about 6 seconds to run. That only matters if we're doing them a bunch per test suite. TBD whether we are.
We already have two concurrent jobs for tests when run scheduled. The "run the tests" part of one of them is >33 minutes so my guess is that we could split or possibly parallelize in-container. Concern there is getting proper code coverage numbers back out if we're not seeing the full picture in any given run.

Let me know how much of my thinking above is directionally correct.

getvictor commented 2 months ago

FYI. I'm working on identifying slow Go CI tests.

lukeheath commented 1 month ago

I'm self-assigning for now to determine how/when we want to action this.

jacobshandling commented 1 week ago

Flakey frontend tests involving tooltips can also be addressed here. There is also some local frontend test flakiness around date-fns observed both by myself and @iansltx, though this doesn't seem to occur in GH CI so should be lowest priority.

fleetdm / fleet

CI is slow and prone to failures #21233

💥 Actual behavior

🕯️ More info (optional)