fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
2.98k stars 413 forks source link

Packaging CI test on macOS is extremely unreliable #22206

Open iansltx opened 1 week ago

iansltx commented 1 week ago

Fleet version: All Web browser and operating system: macOS in CI


💥  Actual behavior

The Test packaging workflow job on macOS has significant reliability issues, being responsible for the vast majority of the ~25% of cancels/fails that we see on that workflow, with >90% of cancels and >80% of failures solely attributable to unreliability on setup of Docker/Colima on macOS.

This is problematic because currently a large percentage of PRs include testing packaging as part of the test suite, even for unrelated work.

🧑‍💻  Steps to reproduce

  1. PR anything that touches these files
  2. See macOS packaging either get cancelled or fail a large % of the time

🛠️ To fix

Run this job less

As an initial mitigation, we could drop the number of paths that trigger packaging testing, as the current paths list casts a wide net. The code path for packaging currently touches the following files:

Additionally, we should keep the following because they're related:

The above path specificity changes should drastically reduce the number of times this workflow is included in builds, exposing fewer PRs/people to be affected.

There's also currently no path specificity rules on pushes to main, patch-*, and prepare-* branches, which means that even content changes on main will trigger this check. If we added even the existing (overly wide) set of paths to determine whether pushes to primary branches got checked for this, we'd cut out a bunch of redundant builds.

Don't run this job

There was also discussion of dropping macOS from the test matrix entirely, as there are a high enough % of devs on macOS that we'll run into any packaging issues naturally if they crop up as we're testing other things.

Make this job more reliable

We're running this on macOS 13, so maybe a newer macOS version would be better.

We could also try switching container runtimes again to see if one is more reliable than the other at this juncture, particularly with a newer OS version.

sharon-fdm commented 5 days ago

@iansltx, since we consider this to be a bug, I am removing the engineering-initiated label.