Allow build and test stages to be combined for official builds

mthalman commented 4 years ago

When the building and testing of Docker images get split out into separate stages, it introduces overhead because the agents used in the test stage need to pull down the images that were built. The PR builds do not use separate stages because that would require access to secrets for accessing the ACR to store the images. Evidence shows that the PR builds run significantly faster as a result of this difference.

.NET Core Multi-stage: 50 mins Single stage: 30 mins

.NET Fx Multi-stage: 2 hr 15 mins Single stage: 1 hr 10 mins

There can be so much variability in the performance of pulling images. It's completely unpredictable. We're really at the mercy of this unpredictable performance, especially for the very large .NET Fx images.

The proposal is for this to be the default way in which builds are executed but still allow for a multi-stage to be optionally used. There are scenarios where it is still beneficial to run multiple stages, such as when pre-building images prior to release tic-toc and then testing/publishing them during the release.

MichaelSimons commented 4 years ago

For .NET Core, we know the long pole of the builds are the Windows ARM legs because of the use of RPI devices. When ARM support goes out of support, what is the projected difference?

We are working to increase the number of shared layers specifically in the SDK images. This will translate to decrease in the number of layers to pull which will help this problem.

There are scenarios in which the multiple stages are beneficial and save time. The most used one is during .NET Core release. The images can get built concurrently as the NuGet packages are published. In the past this has been beneficial when NuGet is slow to index. It is also beneficial when NuGet package publishing misses packages. When this happens, the images don't need to get rebuilt again after the publishing issue is addressed, rather the tests can get rerun. Support for only running build/test can be maintained without stages but it gets complicated and difficult to maintain which is partially why stages were utilized.

mthalman commented 4 years ago

Without Windows ARM, the estimate would be 27 mins for multi-stage and 24 mins for single stage. So the benefit really goes to .NET Fx in that case.

Regarding your scenario for when multiple stages are beneficial, I fully recognize that which is why I noted that we should still allow for multiple stages as an option.

MichaelSimons commented 4 years ago

Regarding the .NET FX time difference, out of curiosity I took a look at the most recent full build (internal link) I found. The build took 2h 20m. Drilling in, the tests took 1h 4m. Of this, the 4.8 1903 leg took 1h by itself. The next longest leg was 23m. In this long leg 58m was spent in two test cases of which 57m was pulling the SDK and WCF images with several retries. I took a look at the second longest test leg and it is a similar story. I am left wondering what causes these pull issues.

Is this a result of pulling images immediately after pushing? Is this an issue with ACR? Is this an issue with our build? The agents are Azure machines so this should be the fastest/most reliable type of ACR pull.
Is it possible MCR is hitting these types of issues when it ingest our images? Solving https://github.com/dotnet/docker-tools/issues/434 would give us these insights?

I'd like to understand this problem. It seems like this is something we should potentially be addressing. If this didn't happen the 2h 15m build would have taken substantially less time. I conjecture so much less time that it eliminates the need to consider combining the build and test stages.

MichaelSimons commented 4 years ago

If it is decided to do this work, https://github.com/dotnet/docker-tools/issues/185 should be closed because it is essentially the opposite issue.

MichaelSimons commented 4 years ago

We are going to need the ability to build independent of running tests and then be able to run the tests at a later time. This is going to add too much complexity.

mthalman commented 2 years ago

I'd like to revisit this issue because of the increased agent provisioning wait time introduced by the migration to the Arcade pools. Wait time for Linux agents averages around 10 minutes. Due to job dependencies in the test stage between the matrix generation job and the actual test jobs, this means the test stage has a 20 minute overhead of waiting for agents.

By combining build and test into a single job, we can avoid this extra test stage and eliminate the 20 minute overhead. But it actually goes further than that. The Post-Build stage could also be eliminated as a separate job and just be combined into the Publish stage. That would eliminate another 10 minutes of overhead. And since the tests would no longer need to pull the image, there'd be a reduction in execution time for that as well. This all means that combining build and test into a single job would reduce overhead by 30+ minutes.

Even if that provisioning time was to be halved by Eng Services making agent infrastructure changes so that the overhead was 15+ minutes, this would still be a worthwhile reduction in time.

Regarding the above comment:

We are going to need the ability to build independent of running tests and then be able to run the tests at a later time.

I don't see it as adding that much complexity. We could still allow independent running of tests through the use of some pipeline variable that would cause all the build steps to be skipped. Similarly, a variable could be used to only build and not test.

mthalman commented 2 years ago

[Triage] - It'd be worthwhile to do a more in-depth analysis on the implementation cost and the scenarios that we care about:

Tic-toc scenario: only build the images in one pipeline run, then test/publish in a second pipeline run.
What is the impact when a build or test failure occurs?

dotnet / docker-tools

Allow build and test stages to be combined for official builds #479