Fleet version: All
Web browser and operating system: macOS in CI
💥 Actual behavior
The Test packaging workflow job on macOS has significant reliability issues, being responsible for the vast majority of the ~25% of cancels/fails that we see on that workflow, with >90% of cancels and >80% of failures solely attributable to unreliability on setup of Docker/Colima on macOS.
This is problematic because currently a large percentage of PRs include testing packaging as part of the test suite, even for unrelated work.
See macOS packaging either get cancelled or fail a large % of the time
🛠️ To fix
Run this job less
As an initial mitigation, we could drop the number of paths that trigger packaging testing, as the current paths list casts a wide net. The code path for packaging currently touches the following files:
cmd/fleetctl/package.go
orbit/pkg/packaging/**.go
orbit/pkg/constant/**.go
orbit/pkg/update/**.go
ee/fleetctl/local_wix.go
pkg/** (can probably tighten this down, as this is still a rather hot directory)
Additionally, we should keep the following because they're related:
'tools/fleetctl-docker/**'
'tools/wix-docker/**'
'tools/bomutils-docker/**'
'.github/workflows/test-packaging.yml'
The above path specificity changes should drastically reduce the number of times this workflow is included in builds, exposing fewer PRs/people to be affected.
There's also currently no path specificity rules on pushes to main, patch-*, and prepare-* branches, which means that even content changes on main will trigger this check. If we added even the existing (overly wide) set of paths to determine whether pushes to primary branches got checked for this, we'd cut out a bunch of redundant builds.
Don't run this job
There was also discussion of dropping macOS from the test matrix entirely, as there are a high enough % of devs on macOS that we'll run into any packaging issues naturally if they crop up as we're testing other things.
Make this job more reliable
We're running this on macOS 13, so maybe a newer macOS version would be better.
We could also try switching container runtimes again to see if one is more reliable than the other at this juncture, particularly with a newer OS version.
Fleet version: All Web browser and operating system: macOS in CI
💥 Actual behavior
The Test packaging workflow job on macOS has significant reliability issues, being responsible for the vast majority of the ~25% of cancels/fails that we see on that workflow, with >90% of cancels and >80% of failures solely attributable to unreliability on setup of Docker/Colima on macOS.
This is problematic because currently a large percentage of PRs include testing packaging as part of the test suite, even for unrelated work.
🧑💻 Steps to reproduce
🛠️ To fix
Run this job less
As an initial mitigation, we could drop the number of paths that trigger packaging testing, as the current paths list casts a wide net. The code path for packaging currently touches the following files:
Additionally, we should keep the following because they're related:
The above path specificity changes should drastically reduce the number of times this workflow is included in builds, exposing fewer PRs/people to be affected.
There's also currently no path specificity rules on pushes to
main
,patch-*
, andprepare-*
branches, which means that even content changes onmain
will trigger this check. If we added even the existing (overly wide) set of paths to determine whether pushes to primary branches got checked for this, we'd cut out a bunch of redundant builds.Don't run this job
There was also discussion of dropping macOS from the test matrix entirely, as there are a high enough % of devs on macOS that we'll run into any packaging issues naturally if they crop up as we're testing other things.
Make this job more reliable
We're running this on macOS 13, so maybe a newer macOS version would be better.
We could also try switching container runtimes again to see if one is more reliable than the other at this juncture, particularly with a newer OS version.