Closed richlander closed 3 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Thanks so much Richard for collecting all these numbers. I must admit I'm still struggling with properly understanding the results though. In particular I find it fascinating how Tiered compilation can make some runs faster and others slower even for presumably the same payload. Similarly, the comparison between composite and non-composite sheds interesting light on the platform differences. I think these are the initial theories I'm drawing from your results:
In most cases TC doesn't seem to improve R2R performance which is kind of expected, after all Bing also runs with TC off. There are just two outliers to this general rule, in "Desktop WSL2 -- i9" we see a systemic regression with TS disabled both in composite and non-composite mode, and non-composite "macOS -- M1" performs worse with TC disabled. It may be interesting to drill deeper into these to get a better understanding of the runtime behavior in the presence / absence of TC. It might be also interesting to understand how the platforms you used for testing differ in terms of CPU extensions and such as the difference between general-purpose code and code targeting a particular CPU extension set is generally believed to be a major factor w.r.t. tiered re-compilation perf delta.
Your initial numbers seem to indicate that composite is better than non-composite on Windows while on Unix it's the opposite. I'm wondering whether that may give us more insight into the platform differences. From my own personal experience bulk file operations (like git clean -xdf) are typically much faster on Unix than on Windows, I'm speculating that some subtle OS differences in filesystem management and/or the manual OS loader implementation we're using on Unix may be affecting whether it's more performant to load and run a single big file or a multitude of small files.
In the "macOS -- Intel" run the perf difference in non-composite mode between TC enabled and disabled is huge (37 seconds), vastly bigger than in any of the other runs. Would you be willing to double-check whether this particular reading couldn't have been affected by some testing glitch? Otherwise this would be super-interesting to drill into as the perf difference is almost 9%.
Tomas
I'm with you on the hard-to-interpret results.
One clarification is that the Windows and macOS results are largely still Linux, since they are running in Linux containers. This is particularly true on Windows, running Docker in WSL2.
I re-ran three the test script on three of the OSes. I'm not a perf measurement expert, but this time I made sure to close as many other apps as possible and not touch the machine at all while running. It appears that the numbers are closer.
There is some network traffic with this test. Perhaps we can change this test to avoid any network traffic. Or just pick another scenario. @davidwrighton only suggested this scenario to validate correctness of the build in this composite environment. I opted to test performance at the same time.
Thanks Richard for validating composite. Its encouraging that you didnt hit any functional issues.
There is some network traffic with this test. Perhaps we can change this test to avoid any network traffic.
assume the network traffic is git clone
? Can the test just measure the pure compile times without the clone?
assume the network traffic is git clone ?
No. I called git clone
before calling time
. If you look at the script, you'll see that. I was thinking about dotnet restore
. I could potentially re-run the tests with restore and build being separate. Before I do that, we should validate that the rest of this test makes sense as a performance test.
I filed this SDK issue about a coherent SDK in order to produce containers with composite images (due to the way we layer the images). I then did some data collection to quantify the severity of the problem. We concluded that most of the problem was (potentially) with Roslyn dependencies. @davidwrighton encourage me to do some testing with containers using composite and suggested building Roslyn as a good test case (big repo of C#). That's what I'm sharing here. I'm posting it to share (primarily) with my team for us to discuss on next steps.
I did some quick surgery on our runtime container images to remove the SHA checks and then I could just use the ARGs/ENVs to set the right values. I found the build versions @ https://github.com/dotnet/runtime/blob/main/docs/project/dogfooding.md. I just downloaded the files I wanted with
curl -Lv
to see the actual URLs.I then wrote two scripts, to build and then test the container images, respectively.
I tested building the Roslyn repo four different ways, to help ferret out the performance differences:
TC = Tiered Compilation.
Summary:
Here are the results from a few difference machines I had available of very different capabilities. I changed the scripts a bit as I went, so the info is a tiny bit different, but not in any relevant way.
Desktop WSL2 -- i9
second run
Synology Linux -- Xeon
Desktop WSL2 -- i7
macOS -- Intel
Second run.
macOS -- M1
second run