Open MichalPetryka opened 2 years ago
Right. There is a general consideration that most "native" compilers (and most shipping apps) target the "lowest common machine" for portability.
For x86/x64 this is x86-x64-v1
which is a machine from ~2004
. For Arm64 this is armv8.0-a
.
Some apps do target and ship for higher baselines, but this is always an explicit action by the project doing the compilation and has, especially for AVX/AVX2, seen negative backlash from the end customers.
A JIT, on the other hand, is always compiling temporary code for "your hardware" and by default will take advantage of whatever your hardware is. Since this generated assembly is in memory only and uncached, portability is not a concern and so JITs have an "advantage" in that they can always be "host native".
This effectively boils down to that while -march=native
for AOT vs JIT is an "interesting" comparison, it is hardly representative of real world shipping apps.
AOT targeting -march=x86-x64-v1
and the JIT targeting -march=native
is a realistic comparison of perf that most end users will actually see.
It is, therefore, likely better for BDN to match this experience by default and for AOT targeting a higher baseline to be an explicit opt-in by the project author (much as it would have to be for any .NET projects they build/deploy outside BDN).
Further to our discord conversation, it would also be good to have the NAOT default since the opportunistic inclusion of all non-VEX instruction sets in addition to the baseline SSE2 can have unpredictable performance impact when compared to the same set of ISAs when they are set as required.
In fact, I'd argue that NAOT default should be the BDN default and that any changes should be explicit just as they are with NAOT.
The ideal fix for this would be to expose target machine as a first-class concept that is visible to the user.
Even if you're doing perf measurements with a JIT-based runtime, you might get wildly different perf characteristics based on the machine the code is deployed to. It would be useful to be able to e.g. test with AVX disabled (COMPlus_EnableAVX=0
) with CoreCLR.
NativeAOT testing currently matches the "use whatever is available on the machine" model.
you might get wildly different perf characteristics based on the machine the code is deployed to.
Yes, although in practice this isn't the case. All major cloud providers are on AVX2+ hardware and things like Steam Hardware Survey reports that 88.96% of users have the same.
The scenarios where you're running on older hardware tends to be rare and so having the JIT target "current machine" matches the actual default deployment scenario for the JIT.
Targeting "current machine" for NAOT, however, does the exact opposite. This is not the default deployment scenario for NAOT and it will lead to incorrect and inconsistent perf results.
Our goal, first and foremost, should be to best match the perf characteristics users will see using the default build/publish
pipeline.
Providing additional configuration knobs allowing explicit opt-in for other configurations (such as disabling AVX for JIT, or enabling AVX for AOT) is then desirable, ideally using the same mechanism (or a near mirror of the mechanism) that the build/publish
commands would require.
I don't know if it's so far off.
BenchmarkDotNet is tuned to measuring the best possible case - that's why it does all the warmup with JIT by default. With tiered PGO you'll get the best possible code for the specific benchmarked scenario. It's not what I would call the real-world version of the code either - it is a lab result - but that's the nature of micro benchmarks.
If we have space for it somewhere in the logs, it would be good for it to capture the HW intrinsic config that was used (both for JIT-based and AOT-based solutions) - the JIT one helps with reproducibility. The AOT one helps with discoverability.
BenchmarkDotNet is tuned to measuring the best possible case - that's why it does all the warmup with JIT by default. With tiered PGO you'll get the best possible code for the specific benchmarked scenario
It's not tuned for the "best possible case". It's tuned for reducing noise and measuring the "final thoughput" that your app would actually hit. For most user interactive applications, websites, or other scenarios; the app will actually achieve "final throughput" for any hot methods. Cold methods typically don't matter and if they do, BDN provides the relevant switches and knobs to test it.
NAOT targeting "current machine" as the core scenario, however, isn't representative of any case a user would see by default and explicitly causes users to miss out on important perf metrics that their actual deployment will hit, particularly around hardware intrinsics due to checks like if (Sse41.IsSupported)
being emitted as a dynamic check. There are many examples of where these dynamic checks can introduce very "worst case" scenarios due to common coding patterns people will use with the JIT where it will be branch free.
This makes it a very big difference and ends up hurting end users more than it helps them, which is the opposite of what any of our tools should be aiming for.
We should always aim to give users correct and representative results for normal/typical usage scenarios. We should then provide the relevant knobs and switches so they can test and validate other scenarios where that is also relevant.
that your app would actually hit
The app would hit it only if it repeatedly calls into e.g. the same framework API with the exact same parameters and never anything else. Dynamic PGO will adapt the framework code to expect those parameters. This is not how APIs are used in real world; it is a synthetic benchmark that measures the best possible case and skews the results, possibly even misleading.
march=native
is the common way to run native benchmarks - e.g. benchmarksgame also doesn't run x64 with SSE2 only even though that would be the real world. Micro benchmarks are not real-world benchmarks. I don't see why we would want to make .NET AOT look worse than Rust or C++ in benchmarks by default.
The purpose of benchmarks isn't to make ourselves look good. It is a tool meant to help users best productize their own apps and to track potential regressions
Other languages favoring "lying" about their results shouldn't matter to us, we should do better, be better, and provide the best features/defaults for users to achieve real world success.
=====
The app would hit it only if it repeatedly calls into e.g. the same framework API with the exact same parameters and never anything else. Dynamic PGO will adapt the framework code to expect those parameters
Tiering will ensure that it is correctly optimized.
Dynamic PGO will train branches and a few other minor things. For real world scenarios it will lead users towards success. For benchmarks, it is a pit of failure for many cases and we'll need to ensure devs understand how it will impact their results. They can still have success and can write benchmarks that measure success.
This is different from NAOT targeting the current hardware, which is a very niche edge case for real world apps and which will definitely lead users towards a pit of failure.
Other languages favoring "lying" about their results shouldn't matter to us, we should do better, be better, and provide the best features/defaults for users to achieve real world success.
This is aligning with what is the industry standard.
Then the industry standard is wrong. It will not help the community achieve success for their actual deployments.
Instead, it will directly lead them into actual pit of failures by lulling them into a false sense of security by providing incorrect data that doesn't match what their actual customers will see when running their apps. It won't even be something that could be hand waved as close enough.
Again, .NET should do and be better.
We should pave the way towards success for the community and our customers. That often involves aligning with standards, but it can just as much be rectifying pits of failure and not aligning with practices where it can and will provide bad or incorrect data.
It's strange to compare BDN, which is a robust benchmarking tool, with Benchmarks Game, which has game right in the name.
As a tool, BDN should handle real-world developer needs. My guess is real-world NAOT users will fall into two main camps:
1) Users concerned primarily with startup time and predictable perf, such as those building serverless apps. These users will likely have a specific minimum hardware target and will compile with that target instruction set list. 2) Users concerned primarily with ease of deployment, such as authors of single file utils meant for distribution across a variety of target hardware. These users will likely compile with the NAOT default instruction set list.
As Tanner pointed out, the second group are likely to encounter certain performance pitfalls related to the fact that NAOT emits runtime checks for post-SSE2 ISAs. It will be important for those users to be able to benchmark NAOT against JITted code, see that there's a difference, and do something to mitigate it, whether that's changing the code or changing the ILC instruction set list. BDN should be the right tool for that kind of testing.
To me, that means sensible defaults so that real-world users can find real-world problems before they ship their apps, as well as exploring the impact of raising the minimum hardware requirements if that's an option for them. BDN really ought to cover both groups, though.
If the concern is over having NAOT possibly (probably?) show up as being slower than CoreCLR in typical use, perhaps a message from BDN about running with ILC defaults would be appropriate. Something along the lines of the error you get when trying to benchmark a debug build but without it being a hard error.
BenchmarkDotNet does not allow for pre-building the benchmarks and running them on a different machine. The results are always specific to current config, this is why we print so much info in the summary table.
BDN is often used to compare various .NET Runtimes performance and we obviously want to make it an apples-to-apples comparisons if we can. That is why the default settings take advantage of the available instructions the same way JIT does. This Hardware Intrinsic information is now being printed in the Summary (#2051) so the users should be aware of that:
BenchmarkDotNet=v0.13.1.20220803-develop, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
[Host] : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT AVX2
Job-KGTSXC : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
Job-XNJYAK : .NET 7.0.0-rc.1.22402.8, X64 NativeAOT AVX2
For those who don't like the defaults, we offer the possibility to customize the settings via NativeAotToolchainBuilder.IlcInstructionSet($value)
method:
I am OK with adding a warning, but not with changing the defaults.
That is why the default settings take advantage of the available instructions the same way JIT does.
That is the point of this discussion.
For BDN targeting JIT it is doing the same thing that users will see "by default" running an app on their machine.
For BDN targeting NAOT, it is not. It is doing something that is not what users will see "by default" running an app on their machine and thus leads them into a pit of failure.
Targeting the same "machine" that JIT vs NAOT targets by default continues providing consistent results across multiple runs. It likewise gives a more real world representation of what users will see for JIT vs NAOT deployments when running their code.
I expect we may actually have several non-trivial perf bugs hiding in the runtime specifically because BDN is doing something different from the default deployment model (that is what users will see for dotnet publish
for NAOT by default). One example of which would be https://source.dot.net/#System.Private.CoreLib/ASCIIUtility.cs,4d0262c58263b9e8, which is a known "hot path" used as part of getting a count of ASCII bytes. We have similar paths, with similar potential issues for Latin1 and UTF-8.
This is likely a hidden perf bug because it won't reproduce when targeting "current machine", because all the hardware we run against (and most hardware users run against) will be AVX or AVX2 compatible. However, the actual default NAOT deployment will target SSE2 (x86-x64-v1
more specifically), and so every Sse41.IsSupported
check will be a dynamic branch in the hot loop.
-- As an aside, regardless of what the BDN default is, we likely have some perf issues.
We likely should be running perf tests, at least as an outerloop, for dotnet/runtime for x86-x64-v1
(dotnet publish
default), x86-x64-v2
(most hardware in the last 15 years), and x86-x64-v3
(effectively march=native
, all enterprise level cloud hardware).
This would both help identify the actual likely hidden perf bugs called out above and would help cover the scenarios that our customers will realistically encounter when running on their hardware.
Defaulting to current machines features, introduced in #1994, doesn't seem reasonable to me, since a developer would likely never distribute such build to customers, ignoring servers with known hardware.
As @tannergooding said a few times on the C# discord, comparing a JIT with an AOT using
march=native
isn't meaningful and here we're doing the same.I'd introduce a separate configuration option
useNativeCpuFeatures
and default that to false to make benchmarking a bit more reasonable.