Simplify the GenerateBuildMatrix command algorithm and the resulting build matrix

lbussell commented 1 year ago

The GenerateBuildMatrix command has been the source of some build breaks lately. It has become very complex and we should re-evaluate whether or not it is accomplishing its goals.

We want a few things out of our build matrices:

The build should be as fast as possible.
The build should be resilient to transient failures from things like tests or the dotnet install script (failing due to flaky internet in most cases).

Addressing [1], Let's analyze a recent internal nightly build: https://dev.azure.com/dnceng/internal/_build/results?buildId=2262218&view=results. [internal link]

The Build stage alone lasted 1h 17m. It doesn't show how long it took to acquire each agent, but no individual build lasted longer than 10m43s once it received an agent. Upwards of 30 minutes can be spent waiting on agents. See the agent queue here: https://dev.azure.com/dnceng/internal/_settings/agentqueues?queueId=349&view=jobs [internal link].

Now lets look at a another build, a merge from main to nightly that rebuilt all Dockerfiles: https://dev.azure.com/dnceng/internal/_build/results?buildId=2241780&view=logs&j=edd22d22-c099-5806-e9e3-0a3334d9bc8d

Looking at one of our larger (in size) distros, Linux_amd64 src-runtime-deps-8.0-jammy-graph, the build itself took 5m29s. Of that, only 1m11s was spent on actually building Dockerfiles.

In these examples, we spend far more "wall-clock" time waiting for build agents and performing other boilerplate tasks that aren't related to actually building images. I hypothesize that we would save time and compute by building more images in each leg.

Currently, we have two build matrix generation "pathways", one for internal builds called PlatformDependencyGraph which builds Dockerfiles solely based on their FROM instructions, and another that builds for tests and PR builds called PlatformVersionedOs: https://github.com/dotnet/docker-tools/blob/0ccb9b854f816b67aae99bae0acb4775ef019a47/src/Microsoft.DotNet.ImageBuilder/src/Commands/GenerateBuildMatrixCommand.cs#L438-L445

If building more images in each leg is more time and cost-efficient, we should always include each image's build and test dependencies in the same build leg. This would eliminate the need for two matrix generation pathways since they would both accomplish the goal of being efficient and testable.

For [2], we currently run builds and tests on separate pipeline stages internally. This allows us to retry tests by themselves if they fail or we need test changes last minute. We could keep this structure and just use the same matrix generation algorithm for both build stages, or we could experiment with incorporating the retry/image pull logic in the build stage.

dotnet-issue-labeler[bot] commented 1 year ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

lbussell commented 7 months ago

We should consider grouping more amounts of images together in our matrix generation (for example, by OS only, not .NET version). The hypothesis is that this would save considerable amounts of compute across the multiple jobs since we'll have less overall overhead due to cleanup and compliance tasks. We should run an experiment to validate this.

dotnet / docker-tools

Simplify the GenerateBuildMatrix command algorithm and the resulting build matrix #1178