Open kamronbatman opened 1 year ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Author: | kamronbatman |
---|---|
Assignees: | - |
Labels: | `tenet-performance`, `area-CodeGen-coreclr`, `untriaged`, `needs-area-label` |
Milestone: | - |
The result on .NET 7 looks like constant elimination.
You can use [DisassemblyDiagnoser]
to check the codegen difference to see if 8.0 solves this issue.
This is actually better than the lowered code I saw from my repo:
It's also worth checking the C# compiler versions. The compiler can also optimize the lowered result in new version.
I tried to avoid constant time elimination. So I would love to get some ideas on a better way to benchmark this. The .NET 8 lowered code didn't look any different from what I saw. Meanwhile, I'll run the disassembly diagnoser and get back to you.
EDIT: I added the disassembly. Note though that I modified the test to avoid the constant time elimination (I hope). I'll update the post with the benchmark code as soon as I get back to the computer.
cc @AndyAyersMS, looks like a jump threading candidate
For the less efficient codegen, see https://github.com/dotnet/roslyn/issues/69411
I'll update the post with the benchmark code as soon as I get back to the computer.
Is the version up top the updated one?
@AndyAyersMS - I updated the benchmark code to match. The post is pretty long. so here are the results after running it again at a glance.
| Method | Job | Runtime | Mean | Error | StdDev | Code Size |
|---------------------- |--------- |--------- |----------:|----------:|----------:|----------:|
| BenchmarkNullableCode | .NET 7.0 | .NET 7.0 | 0.7544 ns | 0.0124 ns | 0.0116 ns | 101 B |
| BenchmarkOldCode | .NET 7.0 | .NET 7.0 | 0.1169 ns | 0.0056 ns | 0.0050 ns | 58 B |
| BenchmarkNullableCode | .NET 8.0 | .NET 8.0 | 0.1297 ns | 0.0059 ns | 0.0055 ns | 99 B |
| BenchmarkOldCode | .NET 8.0 | .NET 8.0 | 0.1181 ns | 0.0054 ns | 0.0045 ns | 58 B |
@AndyAyersMS - any planned movement on this?
@AndyAyersMS - any planned movement on this?
The .NET 8 numbers look reasonably good, so are you asking if (a) we'd improve that further, or (b) retroactively fix .NET 7?
The disassembly in .net 8 doesn't look much better, so are we relying on the JIT optimizations to further improve the assembly? I am simply curious if it's possible to lower this specifically widely used code pattern for nullables so it isn't so gnarly and therefore rely on JIT optimizations techniques. Barring the extra optimization techniques in .net 8 (probably PGO?), the code runs considerably slower than a developer would expect? 🤔
Hmm... 9.0 (p5 at least) seems a bit slower than 8.0.
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3737/23H2/2023Update/SunValley3) Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores .NET SDK 9.0.100-preview.5.24307.3 [Host] : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2 .NET 7.0 : .NET 7.0.20 (7.0.2024.26716), X64 RyuJIT AVX2 .NET 8.0 : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2 .NET 9.0 : .NET 9.0.0 (9.0.24.30607), X64 RyuJIT AVX2
Method | Job | Runtime | Mean | Error | StdDev | Code Size |
---|---|---|---|---|---|---|
BenchmarkNullableCode | .NET 7.0 | .NET 7.0 | 1.2402 ns | 0.0511 ns | 0.0453 ns | 101 B |
BenchmarkOldCode | .NET 7.0 | .NET 7.0 | 0.3828 ns | 0.0284 ns | 0.0266 ns | 58 B |
BenchmarkNullableCode | .NET 8.0 | .NET 8.0 | 0.3623 ns | 0.0082 ns | 0.0069 ns | 102 B |
BenchmarkOldCode | .NET 8.0 | .NET 8.0 | 0.0984 ns | 0.0042 ns | 0.0037 ns | 58 B |
BenchmarkNullableCode | .NET 9.0 | .NET 9.0 | 0.8771 ns | 0.0172 ns | 0.0144 ns | 99 B |
BenchmarkOldCode | .NET 9.0 | .NET 9.0 | 0.1029 ns | 0.0104 ns | 0.0092 ns | 59 B |
Latest main is no better
Method | Mean | Error | StdDev |
---|---|---|---|
BenchmarkNullableCode | 0.9329 ns | 0.0401 ns | 0.0376 ns |
BenchmarkOldCode | 0.2190 ns | 0.0178 ns | 0.0166 ns |
Running locally I get wildly different results from run to run, eg
Method | Job | Toolchain | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|---|---|
BenchmarkNullableCode | Job-TJQMUP | net8 | 0.8824 ns | 0.0446 ns | 0.0458 ns | 1.00 | 0.00 |
BenchmarkNullableCode | Job-YHTPVL | net9 | 0.8284 ns | 0.0434 ns | 0.0406 ns | 0.94 | 0.07 |
Method | Job | Toolchain | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|---|---|
BenchmarkNullableCode | Job-NZWAEZ | net8 | 0.9280 ns | 0.0471 ns | 0.0644 ns | 1.00 | 0.00 |
BenchmarkNullableCode | Job-OWHYHW | net9 | 1.1097 ns | 0.0452 ns | 0.0423 ns | 1.21 | 0.09 |
Method | Job | Toolchain | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|---|---|
BenchmarkNullableCode | Job-WNNBGS | net8 | 0.9035 ns | 0.0197 ns | 0.0184 ns | 1.00 | 0.00 |
BenchmarkNullableCode | Job-QGZPGM | net9 | 0.8542 ns | 0.0315 ns | 0.0294 ns | 0.95 | 0.04 |
Method | Job | Toolchain | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|---|---|
BenchmarkNullableCode | Job-XSHHGG | net8 | 0.8623 ns | 0.0309 ns | 0.0317 ns | 1.00 | 0.00 |
BenchmarkNullableCode | Job-YHVYFK | net9 | 0.9135 ns | 0.0209 ns | 0.0232 ns | 1.06 | 0.05 |
Method | Job | Toolchain | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|---|---|
BenchmarkNullableCode | Job-RFNRHQ | net8 | 0.3588 ns | 0.0147 ns | 0.0169 ns | 1.00 | 0.00 |
BenchmarkNullableCode | Job-ARRJWF | net9 | 0.8860 ns | 0.0138 ns | 0.0153 ns | 2.47 | 0.12 |
It looks like when this tiers up it is right around the threshold for switching to probabilistic counting, and the inconsistent counts somehow can trigger better block layouts. So there is some subtle microarchitectural issue here.
EG this is from a a "slow" net8 run:
-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds weight IBC lp [IL range] [jump] [EH region] [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000] 1 7391. 7391 [000..007)-> BB09 ( cond ) IBC
BB02 [0001] 1 BB01 7391. 7391 [007..015)-> BB04 ( cond ) IBC
BB03 [0002] 1 BB02 0 0 [015..022)-> BB05 (always) rare IBC
BB04 [0003] 1 BB02 7391. 7391 [022..02C) IBC
BB05 [0004] 2 BB03,BB04 7391. 7391 [02C..036)-> BB07 ( cond ) IBC
BB06 [0005] 1 BB05 0 0 [036..042)-> BB08 (always) rare IBC
BB07 [0006] 1 BB05 7391. 7391 [042..050) IBC
BB08 [0007] 2 BB06,BB07 7391. 7391 [050..068)-> BB10 ( cond ) IBC
BB09 [0008] 2 BB01,BB08 0 0 [068..06C)-> BB11 (always) rare IBC
BB10 [0009] 1 BB08 7391. 7391 [06C..06E) IBC
BB11 [0010] 2 BB09,BB10 7391. 7391 [06E..070) (return) IBC
-----------------------------------------------------------------------------------------------------------------------------------------
and this from a fast net9 run (sample based counting threshold is 8192, so we've gone past it)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds weight IBC [IL range] [jump] [EH region] [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000] 1 8356. 8356 [000..007)-> BB09(0.000992),BB02(0.999) ( cond ) IBC
BB02 [0001] 1 BB01 8348. 8348 [007..015)-> BB04(0.997),BB03(0.00261) ( cond ) IBC
BB03 [0002] 1 BB02 21.83 22 [015..022)-> BB05(1) (always) IBC
BB04 [0003] 1 BB02 8326. 8326 [022..02C)-> BB05(1) (always) IBC
BB05 [0004] 2 BB03,BB04 8348. 8348 [02C..036)-> BB07(0.999),BB06(0.000713) ( cond ) IBC
BB06 [0005] 1 BB05 5.95 6 [036..042)-> BB08(1) (always) IBC
BB07 [0006] 1 BB05 8342. 8342 [042..050)-> BB08(1) (always) IBC
BB08 [0007] 2 BB06,BB07 8348. 8348 [050..068)-> BB10(1),BB09(0) ( cond ) IBC
BB09 [0008] 2 BB01,BB08 8.29 8 [068..06C)-> BB11(1) (always) IBC
BB10 [0009] 1 BB08 8348. 8348 [06C..06E)-> BB11(1) (always) IBC
BB11 [0010] 2 BB09,BB10 8356. 8356 [06E..070) (return) IBC
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
As a hypothesis, if I boost the threshold to say 2^16 (via DOTNET_TieredPGO_ScalableCountThreshold=10
), I get numbers that seem more stable. Thought perhaps I'm fooling myself and/or it's not legitimate to run BDN multiple times and infer something from the distribution of results.
Method | Job | Toolchain | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|---|
BenchmarkNullableCode | Job-XWZQSJ | net8 | 1.0229 ns | 0.0067 ns | 0.0059 ns | 1.00 |
BenchmarkNullableCode | Job-JOITVI | net9 | 0.8731 ns | 0.0111 ns | 0.0099 ns | 0.85 |
Method | Job | Toolchain | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|---|---|
BenchmarkNullableCode | Job-LMKDNX | net8 | 1.0392 ns | 0.0248 ns | 0.0232 ns | 1.00 | 0.00 |
BenchmarkNullableCode | Job-RHJLUK | net9 | 0.8662 ns | 0.0067 ns | 0.0060 ns | 0.83 | 0.02 |
Method | Job | Toolchain | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|---|
BenchmarkNullableCode | Job-GXNZLX | net8 | 0.9298 ns | 0.0073 ns | 0.0064 ns | 1.00 |
BenchmarkNullableCode | Job-TRRRNZ | net9 | 0.8706 ns | 0.0117 ns | 0.0110 ns | 0.94 |
Method | Job | Toolchain | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|---|
BenchmarkNullableCode | Job-DSCIBP | net8 | 1.0343 ns | 0.0212 ns | 0.0188 ns | 1.00 |
BenchmarkNullableCode | Job-ZSYRRW | net9 | 0.8689 ns | 0.0104 ns | 0.0097 ns | 0.84 |
Method | Job | Toolchain | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|---|---|
BenchmarkNullableCode | Job-DALTXU | net8 | 0.9321 ns | 0.0217 ns | 0.0193 ns | 1.00 | 0.00 |
BenchmarkNullableCode | Job-MNQFHJ | net9 | 0.8569 ns | 0.0050 ns | 0.0041 ns | 0.92 | 0.02 |
At any rate I don't see us resolving this during .Net 9, and overall .Net 9 seems to do pretty well, so moving this to future.
Description
I was cleaning up some code, and made a simple change in my repo, something like this:
Before:
After:
I happen to have the lowered C# viewer open, and noticed the new code was not that great. So I benchmarked it to see if it mattered for performance, and it looks like the optional chaining has a significant different in performance. I guess I am wondering if there is a way to add in logic to identify a null check and pull it out, as the developer would in older C# code.
Configuration
Here is the Benchmark.NET example code I came up with. I would love to refine this for the sake of tracking down if this is worth fixing:
Regression?
I am not sure if this is a regression per-se, or a naive unfurling of the syntax sugar and warrants investigation? At the least, the community should be made aware to not "clean" their code in this way I suppose.
Data
Benchmarks for .NET 7
Disassembly Diagnoser for .NET 7
.NET 7.0.12 (7.0.1223.47720), X64 RyuJIT AVX2
.NET 7.0.12 (7.0.1223.47720), X64 RyuJIT AVX2
Benchmarks for .NET 8
I ran the same benchmark with .NET 8, and the results look good. So maybe this is a moot point?
Disassembly Diagnoser for .NET 8
.NET 8.0.0 (8.0.23.47906), X64 RyuJIT AVX2
.NET 8.0.0 (8.0.23.47906), X64 RyuJIT AVX2
Analysis
The lowered code for the null conditional looks like this:
This is actually better than the lowered code I saw from my repo:
Note the ternary:
num1 = nullable4.GetValueOrDefault() > num3 & nullable4.HasValue ? 1 : 0;
- Not sure why that was necessary.From what I can tell, it is simply really verbose and not efficient.