Testing composite R2R images with containers

richlander commented 3 years ago

I filed this SDK issue about a coherent SDK in order to produce containers with composite images (due to the way we layer the images). I then did some data collection to quantify the severity of the problem. We concluded that most of the problem was (potentially) with Roslyn dependencies. @davidwrighton encourage me to do some testing with containers using composite and suggested building Roslyn as a good test case (big repo of C#). That's what I'm sharing here. I'm posting it to share (primarily) with my team for us to discuss on next steps.

I did some quick surgery on our runtime container images to remove the SHA checks and then I could just use the ARGs/ENVs to set the right values. I found the build versions @ https://github.com/dotnet/runtime/blob/main/docs/project/dogfooding.md. I just downloaded the files I wanted with curl -Lv to see the actual URLs.

I then wrote two scripts, to build and then test the container images, respectively.

I tested building the Roslyn repo four different ways, to help ferret out the performance differences:

Composite
Composite w/TC disabled
Non-composite
Non-composite w/TC disabled

TC = Tiered Compilation.

Summary:

All the builds passed. I was not able to demonstrate any breakage due to my .NET SDK incoherence concerns. This was the main purpose of this test.
Composite is primarily focused on startup. The tests that I ran were not startup tests, so we need to focus on the differences between the modes as a kind of loose proxy of benefit.
The best proxy is the difference between the second and fourth tests. The second one should perform better.
For the most part we see that, but it's a little murky, particularly on Arm64/M1.

Here are the results from a few difference machines I had available of very different capabilities. I changed the scripts a bit as I went, so the info is a tiny bit different, but not in any relevant way.

Desktop WSL2 -- i9

rich@DESKTOP-9DDVRRS:~/git/dotnet-docker/src$ ./test.sh
Cloning into 'roslyn-temp'...
remote: Enumerating objects: 19276, done.
remote: Counting objects: 100% (19276/19276), done.
remote: Compressing objects: 100% (15470/15470), done.
remote: Total 19276 (delta 6301), reused 6535 (delta 3589), pack-reused 0
Receiving objects: 100% (19276/19276), 45.00 MiB | 7.67 MiB/s, done.
Resolving deltas: 100% (6301/6301), done.
6.0-bullseye-slim-amd64: Pulling from dotnet/sdk
Digest: sha256:2fa84f8bdc3c0477722bd1a18e72f287fa153ab52b14498b602485105a129438
Status: Image is up to date for mcr.microsoft.com/dotnet/sdk:6.0-bullseye-slim-amd64
mcr.microsoft.com/dotnet/sdk:6.0-bullseye-slim-amd64
check for composite file
./Microsoft.NETCore.App/6.0.0-preview.7.21316.3/Microsoft.NETCore.App.Composite.r2r.dll

composite with TC 
real    1m26.701s
user    0m0.078s
sys     0m0.095s

composite with TC disabled 
real    1m30.289s
user    0m0.099s
sys     0m0.074s

non-composite with TC
real    1m27.213s
user    0m0.068s
sys     0m0.107s

non-composite with TC disabled
real    1m33.506s
user    0m0.103s
sys     0m0.064s

second run

rich@DESKTOP-9DDVRRS:~/git/dotnet-docker/src$ ./test.sh
Cloning into 'roslyn-temp'...
remote: Enumerating objects: 19276, done.
remote: Counting objects: 100% (19276/19276), done.
remote: Compressing objects: 100% (15484/15484), done.
remote: Total 19276 (delta 6300), reused 6521 (delta 3575), pack-reused 0
Receiving objects: 100% (19276/19276), 45.05 MiB | 7.81 MiB/s, done.
Resolving deltas: 100% (6300/6300), done.
6.0-bullseye-slim-amd64: Pulling from dotnet/sdk
Digest: sha256:2fa84f8bdc3c0477722bd1a18e72f287fa153ab52b14498b602485105a129438
Status: Image is up to date for mcr.microsoft.com/dotnet/sdk:6.0-bullseye-slim-amd64
mcr.microsoft.com/dotnet/sdk:6.0-bullseye-slim-amd64

composite scenarios
6.0.100-preview.5.21302.13
./Microsoft.NETCore.App/6.0.0-preview.7.21316.3/Microsoft.NETCore.App.Composite.r2r.dll

composite with TC
real    1m29.212s
user    0m0.095s
sys     0m0.065s

composite with TC disabled
real    1m29.311s
user    0m0.109s
sys     0m0.063s

non-composite scenarios
6.0.100-preview.5.21302.13

non-composite with TC
real    1m24.984s
user    0m0.086s
sys     0m0.076s

non-composite with TC disabled
real    1m35.148s
user    0m0.080s
sys     0m0.097s

Synology Linux -- Xeon

rich@kamloops:~/git/dotnet-docker/src$ ./test.sh 
Cloning into 'roslyn-temp'...
remote: Enumerating objects: 19276, done.
remote: Counting objects: 100% (19276/19276), done.
remote: Compressing objects: 100% (15488/15488), done.
remote: Total 19276 (delta 6298), reused 6520 (delta 3571), pack-reused 0
Receiving objects: 100% (19276/19276), 44.97 MiB | 7.41 MiB/s, done.
Resolving deltas: 100% (6298/6298), done.
Updating files: 100% (16405/16405), done.
6.0-bullseye-slim-amd64: Pulling from dotnet/sdk
Digest: sha256:2fa84f8bdc3c0477722bd1a18e72f287fa153ab52b14498b602485105a129438
Status: Image is up to date for mcr.microsoft.com/dotnet/sdk:6.0-bullseye-slim-amd64

composite scenarios
6.0.100-preview.5.21302.13
./Microsoft.NETCore.App/6.0.0-preview.7.21316.3/Microsoft.NETCore.App.Composite.r2r.dll

composite with TC
real    4m28.031s
user    0m0.037s
sys 0m0.010s

composite with TC disabled
real    4m24.428s
user    0m0.033s
sys 0m0.017s

non-composite scenarios
6.0.100-preview.5.21302.13

non-composite with TC
real    4m26.700s
user    0m0.040s
sys 0m0.014s

non-composite with TC disabled
real    4m26.192s
user    0m0.036s
sys 0m0.016s

Desktop WSL2 -- i7

rich@mazama:~/git/dotnet-docker/src$ ./test.sh
Cloning into 'roslyn-temp'...
remote: Enumerating objects: 19276, done.
remote: Counting objects: 100% (19276/19276), done.
remote: Compressing objects: 100% (15488/15488), done.
remote: Total 19276 (delta 6298), reused 6522 (delta 3571), pack-reused 0
Receiving objects: 100% (19276/19276), 44.97 MiB | 7.63 MiB/s, done.
Resolving deltas: 100% (6298/6298), done.
Updating files: 100% (16405/16405), done.
6.0-bullseye-slim-amd64: Pulling from dotnet/sdk
Digest: sha256:2fa84f8bdc3c0477722bd1a18e72f287fa153ab52b14498b602485105a129438
Status: Image is up to date for mcr.microsoft.com/dotnet/sdk:6.0-bullseye-slim-amd64
mcr.microsoft.com/dotnet/sdk:6.0-bullseye-slim-amd64
composite scenarios
6.0.100-preview.5.21302.13
./Microsoft.NETCore.App/6.0.0-preview.7.21316.3/Microsoft.NETCore.App.Composite.r2r.dll

composite with TC
real    2m13.598s
user    0m0.092s
sys     0m0.086s

composite with TC disabled
real    2m11.850s
user    0m0.083s
sys     0m0.119s

non-composite scenarios
6.0.100-preview.5.21302.13

non-composite with TC
real    2m31.538s
user    0m0.115s
sys     0m0.082s

composite with TC disabled
real    2m22.477s
user    0m0.118s
sys     0m0.099s

macOS -- Intel

rich@thundera src % ./test.sh 
Cloning into 'roslyn-temp'...
remote: Enumerating objects: 19276, done.
remote: Counting objects: 100% (19276/19276), done.
remote: Compressing objects: 100% (15483/15483), done.
remote: Total 19276 (delta 6301), reused 6522 (delta 3576), pack-reused 0
Receiving objects: 100% (19276/19276), 45.05 MiB | 7.66 MiB/s, done.
Resolving deltas: 100% (6301/6301), done.
Checking out files: 100% (16405/16405), done.
6.0-bullseye-slim-amd64: Pulling from dotnet/sdk
Digest: sha256:2fa84f8bdc3c0477722bd1a18e72f287fa153ab52b14498b602485105a129438
Status: Image is up to date for mcr.microsoft.com/dotnet/sdk:6.0-bullseye-slim-amd64
mcr.microsoft.com/dotnet/sdk:6.0-bullseye-slim-amd64

composite scenarios
6.0.100-preview.5.21302.13
./Microsoft.NETCore.App/6.0.0-preview.7.21316.3/Microsoft.NETCore.App.Composite.r2r.dll

composite with TC
real    7m56.242s
user    0m0.152s
sys 0m0.094s

composite with TC disabled
real    7m43.756s
user    0m0.186s
sys 0m0.116s

non-composite scenarios
6.0.100-preview.5.21302.13

non-composite with TC
real    7m38.097s
user    0m0.153s
sys 0m0.104s

non-composite with TC disabled
real    7m1.547s
user    0m0.138s
sys 0m0.095s

Second run.

rich@thundera src % ./test.sh 
Cloning into 'roslyn-temp'...
remote: Enumerating objects: 19276, done.
remote: Counting objects: 100% (19276/19276), done.
remote: Compressing objects: 100% (15471/15471), done.
remote: Total 19276 (delta 6300), reused 6535 (delta 3588), pack-reused 0
Receiving objects: 100% (19276/19276), 45.00 MiB | 7.69 MiB/s, done.
Resolving deltas: 100% (6300/6300), done.
Checking out files: 100% (16405/16405), done.
6.0-bullseye-slim-amd64: Pulling from dotnet/sdk
Digest: sha256:2fa84f8bdc3c0477722bd1a18e72f287fa153ab52b14498b602485105a129438
Status: Image is up to date for mcr.microsoft.com/dotnet/sdk:6.0-bullseye-slim-amd64
mcr.microsoft.com/dotnet/sdk:6.0-bullseye-slim-amd64

composite scenarios
6.0.100-preview.5.21302.13
./Microsoft.NETCore.App/6.0.0-preview.7.21316.3/Microsoft.NETCore.App.Composite.r2r.dll

composite with TC
real    7m41.896s
user    0m0.161s
sys 0m0.098s

composite with TC disabled
real    7m32.627s
user    0m0.136s
sys 0m0.092s

non-composite scenarios
6.0.100-preview.5.21302.13

non-composite with TC
real    7m20.162s
user    0m0.136s
sys 0m0.094s

non-composite with TC disabled
real    7m32.863s
user    0m0.139s
sys 0m0.096s

macOS -- M1

rich@wayfarer src % ./test.sh 
Cloning into 'roslyn-temp'...
remote: Enumerating objects: 19276, done.
remote: Counting objects: 100% (19276/19276), done.
remote: Compressing objects: 100% (15470/15470), done.
remote: Total 19276 (delta 6301), reused 6537 (delta 3589), pack-reused 0
Receiving objects: 100% (19276/19276), 45.00 MiB | 7.45 MiB/s, done.
Resolving deltas: 100% (6301/6301), done.
Updating files: 100% (16405/16405), done.
6.0-bullseye-slim-arm64v8: Pulling from dotnet/sdk
Digest: sha256:dbb6bda5efa7f61ab911cf32c45dbaad0bbf2372ffed85782a3179d0b583a695
Status: Downloaded newer image for mcr.microsoft.com/dotnet/sdk:6.0-bullseye-slim-arm64v8
mcr.microsoft.com/dotnet/sdk:6.0-bullseye-slim-arm64v8
composite scenarios
6.0.100-preview.5.21302.13
./Microsoft.NETCore.App/6.0.0-preview.7.21316.3/Microsoft.NETCore.App.Composite.r2r.dll

composite with TC
real    10m51.905s
user    0m0.127s
sys 0m0.245s

composite with TC disabled
real    11m28.032s
user    0m0.127s
sys 0m0.211s

non-composite scenarios
6.0.100-preview.5.21302.13

non-composite with TC
real    12m50.099s
user    0m0.146s
sys 0m0.222s

composite with TC disabled
real    9m54.769s
user    0m0.121s
sys 0m0.206s

second run

rich@wayfarer src % ./test.sh 
Cloning into 'roslyn-temp'...
remote: Enumerating objects: 19276, done.
remote: Counting objects: 100% (19276/19276), done.
remote: Compressing objects: 100% (15484/15484), done.
remote: Total 19276 (delta 6300), reused 6521 (delta 3575), pack-reused 0
Receiving objects: 100% (19276/19276), 45.05 MiB | 7.92 MiB/s, done.
Resolving deltas: 100% (6300/6300), done.
Updating files: 100% (16405/16405), done.
6.0-bullseye-slim-arm64v8: Pulling from dotnet/sdk
Digest: sha256:dbb6bda5efa7f61ab911cf32c45dbaad0bbf2372ffed85782a3179d0b583a695
Status: Image is up to date for mcr.microsoft.com/dotnet/sdk:6.0-bullseye-slim-arm64v8
mcr.microsoft.com/dotnet/sdk:6.0-bullseye-slim-arm64v8

composite scenarios
6.0.100-preview.5.21302.13
./Microsoft.NETCore.App/6.0.0-preview.7.21316.3/Microsoft.NETCore.App.Composite.r2r.dll

composite with TC
real    10m49.510s
user    0m0.124s
sys 0m0.135s

composite with TC disabled
real    10m40.202s
user    0m0.130s
sys 0m0.139s

non-composite scenarios
6.0.100-preview.5.21302.13

non-composite with TC
real    11m9.314s
user    0m0.133s
sys 0m0.136s

non-composite with TC disabled
real    11m6.894s
user    0m0.131s
sys 0m0.155s

dotnet-issue-labeler[bot] commented 3 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

trylek commented 3 years ago

Thanks so much Richard for collecting all these numbers. I must admit I'm still struggling with properly understanding the results though. In particular I find it fascinating how Tiered compilation can make some runs faster and others slower even for presumably the same payload. Similarly, the comparison between composite and non-composite sheds interesting light on the platform differences. I think these are the initial theories I'm drawing from your results:

In most cases TC doesn't seem to improve R2R performance which is kind of expected, after all Bing also runs with TC off. There are just two outliers to this general rule, in "Desktop WSL2 -- i9" we see a systemic regression with TS disabled both in composite and non-composite mode, and non-composite "macOS -- M1" performs worse with TC disabled. It may be interesting to drill deeper into these to get a better understanding of the runtime behavior in the presence / absence of TC. It might be also interesting to understand how the platforms you used for testing differ in terms of CPU extensions and such as the difference between general-purpose code and code targeting a particular CPU extension set is generally believed to be a major factor w.r.t. tiered re-compilation perf delta.
Your initial numbers seem to indicate that composite is better than non-composite on Windows while on Unix it's the opposite. I'm wondering whether that may give us more insight into the platform differences. From my own personal experience bulk file operations (like git clean -xdf) are typically much faster on Unix than on Windows, I'm speculating that some subtle OS differences in filesystem management and/or the manual OS loader implementation we're using on Unix may be affecting whether it's more performant to load and run a single big file or a multitude of small files.
In the "macOS -- Intel" run the perf difference in non-composite mode between TC enabled and disabled is huge (37 seconds), vastly bigger than in any of the other runs. Would you be willing to double-check whether this particular reading couldn't have been affected by some testing glitch? Otherwise this would be super-interesting to drill into as the perf difference is almost 9%.

Tomas

richlander commented 3 years ago

I'm with you on the hard-to-interpret results.

One clarification is that the Windows and macOS results are largely still Linux, since they are running in Linux containers. This is particularly true on Windows, running Docker in WSL2.

I re-ran three the test script on three of the OSes. I'm not a perf measurement expert, but this time I made sure to close as many other apps as possible and not touch the machine at all while running. It appears that the numbers are closer.

There is some network traffic with this test. Perhaps we can change this test to avoid any network traffic. Or just pick another scenario. @davidwrighton only suggested this scenario to validate correctness of the build in this composite environment. I opted to test performance at the same time.

mangod9 commented 3 years ago

Thanks Richard for validating composite. Its encouraging that you didnt hit any functional issues.

There is some network traffic with this test. Perhaps we can change this test to avoid any network traffic.

assume the network traffic is git clone ? Can the test just measure the pure compile times without the clone?

richlander commented 3 years ago

assume the network traffic is git clone ?

No. I called git clone before calling time. If you look at the script, you'll see that. I was thinking about dotnet restore. I could potentially re-run the tests with restore and build being separate. Before I do that, we should validate that the rest of this test makes sense as a performance test.

dotnet / runtime