Closed kaalus closed 2 years ago
I'd recommend to rewrite the benchmark to smaller sub-benchmarks and with Benchmark.NET and let it to care about iterations and input, your code also measures Random.NextValue speed which is not desired I guess.
It looks like you got trapped by Tiered Compilation, see also Run-time configuration options for compilation on how to disable it.
For benchmarking it is advisable to use https://benchmarkdotnet.org/.
Thanks for reporting it @kaalus.
It looks like you got trapped by Tiered Compilation, see also Run-time configuration options for compilation on how to disable it.
TestLongMultiply(); // Make sure all's jitted
that should be enough to warm TestLongMultiply(double v1, double v2)
and recompile it with full jit and you can't be caught in a loop because we have TC_QuickJitForLoops=0
by default, but it would be interesting to see results with complus_TieredCompilation=0
.
The code does tier up as the key methods are called in a loop, and the driver methods with loops will bypass Tier0. It takes about 200ms on my machine for Tier1 code to get installed.
However, I also see apparent regressions with tiering disabled.
I converted this over to BDN (keeping the code structure as is) and it also reports a regression from desktop. The 3.1 runs are quite a bit more varied and somewhat bimodal, and their "fast" time is closer to 4.8's.
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.778 (1909/November2018Update/19H2)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-alpha.1.20073.10
[Host] : .NET Core 2.1.16 (CoreCLR 4.6.28516.03, CoreFX 4.6.28516.10), X64 RyuJIT
.NET 4.7.2 : .NET Framework 4.8 (4.8.4121.0), X64 RyuJIT
.NET Core 2.1 : .NET Core 2.1.16 (CoreCLR 4.6.28516.03, CoreFX 4.6.28516.10), X64 RyuJIT
.NET Core 3.1 : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.21406), X64 RyuJIT
Method | Job | Runtime | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|---|---|
TestLongMultiply | .NET 4.7.2 | .NET 4.7.2 | 3.270 s | 0.0297 s | 0.0264 s | 1.00 | 0.00 |
TestLongMultiply | .NET Core 2.1 | .NET Core 2.1 | 3.467 s | 0.0552 s | 0.0516 s | 1.06 | 0.02 |
TestLongMultiply | .NET Core 3.1 | .NET Core 3.1 | 3.799 s | 0.0144 s | 0.0113 s | 1.16 | 0.01 |
So this merits a closer look. Note runtimes seem pretty volatile -- if I reduce the value of count1
in the test code I can sometimes get quite different results.
Probably not going to get to this for 5.0, so marking as future.
Running a further modified BDN version (no looping in the test method, no random values) still shows 5.0 regressions:
(edit -- disregard this, see below)
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.102
[Host] : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT
Job-VGGTDI : .NET Framework 4.8 (4.8.4250.0), X64 RyuJIT
Job-NMVNIB : .NET Core 2.1.24 (CoreCLR 4.6.29518.01, CoreFX 4.6.29518.01), X64 RyuJIT
Job-QZTYYY : .NET Core 3.1.11 (CoreCLR 4.700.20.56602, CoreFX 4.700.20.56604), X64 RyuJIT
Job-TENKRT : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT
| Method | Job | Runtime | Toolchain | Mean | Error | StdDev | Median | Ratio | RatioSD |
|----------------- |----------- |-------------- |-------------- |-----------:|---------:|---------:|-----------:|------:|--------:|
| TestLongMultiply | Job-VGGTDI | .NET 4.7.2 | net472 | 1,040.9 ns | 32.36 ns | 95.42 ns | 1,018.7 ns | 1.00 | 0.00 |
| TestLongMultiply | Job-NMVNIB | .NET Core 2.1 | netcoreapp2.1 | 830.9 ns | 28.65 ns | 82.21 ns | 802.8 ns | 0.81 | 0.09 |
| TestLongMultiply | Job-QZTYYY | .NET Core 3.1 | netcoreapp3.1 | 829.6 ns | 16.62 ns | 21.02 ns | 831.3 ns | 0.80 | 0.07 |
| TestLongMultiply | Job-TENKRT | .NET Core 5.0 | netcoreapp5.0 | 1,105.2 ns | 21.97 ns | 45.86 ns | 1,093.5 ns | 1.05 | 0.10 |
In the above all the time is spent constructing the Random
object, so needed to remove that call too. Now we have:
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.102
[Host] : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT
Job-OYHRDB : .NET Framework 4.8 (4.8.4250.0), X64 RyuJIT
Job-NJBFZP : .NET Core 2.1.24 (CoreCLR 4.6.29518.01, CoreFX 4.6.29518.01), X64 RyuJIT
Job-GOZUDV : .NET Core 3.1.11 (CoreCLR 4.700.20.56602, CoreFX 4.700.20.56604), X64 RyuJIT
Job-GFNDSP : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT
| Method | Job | Runtime | Toolchain | Mean | Error | StdDev | Median | Ratio | RatioSD |
|----------------- |----------- |-------------- |-------------- |---------:|---------:|----------:|---------:|------:|--------:|
| TestLongMultiply | Job-OYHRDB | .NET 4.7.2 | net472 | 305.7 ns | 5.93 ns | 7.71 ns | 304.8 ns | 1.00 | 0.00 |
| TestLongMultiply | Job-NJBFZP | .NET Core 2.1 | netcoreapp2.1 | 284.7 ns | 3.59 ns | 3.00 ns | 284.9 ns | 0.93 | 0.03 |
| TestLongMultiply | Job-GOZUDV | .NET Core 3.1 | netcoreapp3.1 | 372.5 ns | 8.88 ns | 25.77 ns | 363.0 ns | 1.24 | 0.08 |
| TestLongMultiply | Job-GFNDSP | .NET Core 5.0 | netcoreapp5.0 | 622.5 ns | 38.72 ns | 111.71 ns | 588.8 ns | 2.19 | 0.30 |
and still what looks like a clear regression.
Results still fluctuate quite a bit from run to run and sometimes only show a very small regression. ETL analysis of just benchmark intervals shows:
;; 4.7.2
02.20% 3.07E+06 ? Unknown
31.62% 4.405E+07 jit ??? [bdn]Program.TestLongMultiply(float64,float64)
30.94% 4.311E+07 jit ??? [bdn]Fixed.op_Explicit(value class Fixed)
20.43% 2.846E+07 jit ??? [bdn]Fixed.op_Implicit(float64)
08.96% 1.249E+07 jit ??? [bdn]Fixed.LongMultiply(value class Fixed,value class Fixed)
04.21% 5.86E+06 jit ??? [bdn]Program.TestLongMultiply()
;; 5.0.2
06.11% 9.48E+06 ? Unknown
28.56% 4.435E+07 Tier-1 [bdn]Fixed.op_Explicit(value class Fixed)
26.82% 4.164E+07 Tier-1 [bdn]Program.TestLongMultiply(float64,float64)
22.32% 3.465E+07 Tier-1 [bdn]Fixed.op_Implicit(float64)
09.96% 1.546E+07 Tier-1 [bdn]Fixed.LongMultiply(value class Fixed,value class Fixed)
05.79% 8.99E+06 FullOpt [bdn]Program.TestLongMultiply()
so suggests several methods may have worse CQ; however codegen looks similar.
4.7.2 | 5.0.2 |
```assembly ; Fixed.op_Implicit(Double) vmovsd qword ptr [rsp+8],xmm0 mov rax,[rsp+8] mov rcx,0FFFFFFFFFFFF and rcx,rax mov rdx,0 or rdx,rcx mov rcx,rax shr rcx,34 and ecx,7FF add ecx,0FFFFFC01 mov r8,0 test rax,r8 je short M03_L00 neg rdx M03_L00: mov eax,ecx neg eax lea ecx,[rax+24] test ecx,ecx jle short M03_L02 cmp ecx,40 jl short M03_L01 xor eax,eax ret M03_L01: mov rax,rdx sar rax,cl ret M03_L02: cmp ecx,0FFFFFFC0 jg short M03_L03 xor eax,eax ret M03_L03: neg ecx mov rax,rdx shl rax,cl ret ; Total bytes of code 118 ``` | ```assembly ; Fixed.op_Implicit(Double) vmovsd qword ptr [rsp+8],xmm0 mov rax,[rsp+8] mov rdx,0FFFFFFFFFFFF and rdx,rax mov r8,0 or rdx,r8 mov rcx,rax shr rcx,34 and ecx,7FF add ecx,0FFFFFC01 mov r8,0 test rax,r8 je short M03_L00 neg rdx M03_L00: neg ecx add ecx,24 test ecx,ecx jle short M03_L02 cmp ecx,40 jl short M03_L01 xor eax,eax ret M03_L01: mov rax,rdx sar rax,cl ret M03_L02: cmp ecx,0FFFFFFC0 jg short M03_L03 xor eax,eax ret M03_L03: neg ecx mov rax,rdx shl rax,cl ret ; Total bytes of code 115 ``` |
```assembly ; Fixed.LongMultiply(Fixed, Fixed) mov eax,0FFFFFFFF and rax,rcx mov r8,rdx sar r8,20 mov r9d,0FFFFFFFF and rdx,r9 sar rcx,20 mov r9,rcx imul r9,r8 shl r9,30 imul rcx,rdx imul r8,rax add rcx,r8 shl rcx,10 imul rax,rdx shr rax,10 add r9,rcx add rax,r9 ret ; Total bytes of code 69 ``` | ```assembly ; Fixed.LongMultiply(Fixed, Fixed) mov eax,0FFFFFFFF and rax,rcx mov r8,rdx sar r8,20 mov r9d,0FFFFFFFF and rdx,r9 sar rcx,20 mov r9,rcx imul r9,r8 shl r9,30 imul rcx,rdx imul r8,rax add rcx,r8 shl rcx,10 imul rax,rdx shr rax,10 add rcx,r9 add rax,rcx ret ; Total bytes of code 69 ``` |
@kaalus do you have a more fully realized use case for this code that shows regressions?
@jakobbotsch is going to take a fresh look at this.
The code generated for .NET 6.0 seems to be almost identical for all but the main driving loop: https://www.diffchecker.com/lztPvQsp (.NET4.8 on the left, .NET 6 on the right) For the main driving loop the code looks much better in .NET 6, yet I can still reproduce the regression. VTune seems to indicate that it's front-end bound and that the regression might be related to the µop cache. Some hardware counters (here .NET 6 is on the left, .NET 4.8 on the right):
Hardware Event Type | Hardware Event Count |
---|---|
ARITH.DIVIDER_ACTIVE | 3,456,003,456 - 100,683,100,683 = -97,227,097,227 |
BACLEARS.ANY | 37,800,756 - 24,300,486 = 13,500,270 |
BR_MISP_RETIRED.ALL_BRANCHES_PS | Not changed, 334,806,696 |
CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE | 1,836,036,720 - 1,629,482,589 = 206,554,131 |
CPU_CLK_UNHALTED.REF_TSC | 166,061,900,000 - 147,117,950,000 = 18,943,950,000 |
CPU_CLK_UNHALTED.REF_XCLK | 1,875,187,503 - 1,696,983,939 = 178,203,564 |
CPU_CLK_UNHALTED.THREAD | 308,324,400,000 - 270,939,050,000 = 37,385,350,000 |
CPU_CLK_UNHALTED.THREAD_P | 306,612,306,612 - 274,455,274,455 = 32,157,032,157 |
CYCLE_ACTIVITY.STALLS_L1D_MISS | 189,000,189 - 162,000,162 = 27,000,027 |
CYCLE_ACTIVITY.STALLS_L2_MISS | 162,000,162 - 54,000,054 = 108,000,108 |
CYCLE_ACTIVITY.STALLS_L3_MISS | 162,000,162 - 0 = 162,000,162 |
CYCLE_ACTIVITY.STALLS_MEM_ANY | 10,584,010,584 - 7,425,007,425 = 3,159,003,159 |
DSB2MITE_SWITCHES.PENALTY_CYCLES | 14,283,014,283 - 8,883,008,883 = 5,400,005,400 |
DTLB_LOAD_MISSES.STLB_HIT | 108,000,108 - 81,000,081 = 27,000,027 |
DTLB_LOAD_MISSES.WALK_ACTIVE | 20,250,405 - 27,000,540 = -6,750,135 |
DTLB_STORE_MISSES.STLB_HIT | 35,100,702 - 24,300,486 = 10,800,216 |
DTLB_STORE_MISSES.WALK_ACTIVE | 1,350,027 - 12,150,243 = -10,800,216 |
EXE_ACTIVITY.1_PORTS_UTIL | 37,017,037,017 - 28,728,028,728 = 8,289,008,289 |
EXE_ACTIVITY.2_PORTS_UTIL | 74,493,074,493 - 55,917,055,917 = 18,576,018,576 |
EXE_ACTIVITY.BOUND_ON_STORES | Not changed, 0 |
EXE_ACTIVITY.EXE_BOUND_0_PORTS | 891,000,891 - 1,080,001,080 = -189,000,189 |
FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE | Not changed, 0 |
FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE | Not changed, 0 |
FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE | Not changed, 0 |
FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE | Not changed, 0 |
FP_ARITH_INST_RETIRED.SCALAR_DOUBLE | 46,278,046,278 - 46,413,046,413 = -135,000,135 |
FP_ARITH_INST_RETIRED.SCALAR_SINGLE | Not changed, 0 |
FP_ASSIST.ANY | Not changed, 0 |
FRONTEND_RETIRED.DSB_MISS_PS | 3,290,147,397 - 668,290,095 = 2,621,857,302 |
FRONTEND_RETIRED.L2_MISS_PS | 1,350,081 - 0 = 1,350,081 |
FRONTEND_RETIRED.LATENCY_GE_16_PS | 8,100,486 - 16,200,972 = -8,100,486 |
FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS | 29,639,678,274 - 12,870,322,173 = 16,769,356,101 |
FRONTEND_RETIRED.LATENCY_GE_4_PS | 344,270,655 - 101,256,075 = 243,014,580 |
FRONTEND_RETIRED.STLB_MISS_PS | 0 - 1,350,081 = -1,350,081 |
ICACHE_16B.IFDATA_STALL | 270,000,270 - 351,000,351 = -81,000,081 |
ICACHE_16B.IFDATA_STALL:cmask=1:e=yes | 54,000,054 - 27,000,027 = 27,000,027 |
ICACHE_64B.IFTAG_STALL | 1,147,511,475 - 1,071,910,719 = 75,600,756 |
IDQ.ALL_DSB_CYCLES_4_UOPS | 95,445,095,445 - 130,572,130,572 = -35,127,035,127 |
IDQ.ALL_DSB_CYCLES_ANY_UOPS | 157,626,157,626 - 202,878,202,878 = -45,252,045,252 |
IDQ.ALL_MITE_CYCLES_4_UOPS | 63,504,063,504 - 5,832,005,832 = 57,672,057,672 |
IDQ.ALL_MITE_CYCLES_ANY_UOPS | 100,629,100,629 - 33,939,033,939 = 66,690,066,690 |
IDQ.DSB_UOPS | 534,654,534,654 - 830,439,830,439 = -295,785,295,785 |
IDQ.MITE_UOPS | 353,322,353,322 - 74,439,074,439 = 278,883,278,883 |
IDQ.MS_SWITCHES | 405,000,405 - 216,000,216 = 189,000,189 |
IDQ.MS_UOPS | Not changed, 3,132,003,132 |
IDQ_UOPS_NOT_DELIVERED.CORE | 272,160,272,160 - 124,173,124,173 = 147,987,147,987 |
IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE | 16,497,016,497 - 4,293,004,293 = 12,204,012,204 |
ILD_STALL.LCP | Not changed, 0 |
INST_RETIRED.ANY | 744,658,450,000 - 748,778,600,000 = -4,120,150,000 |
INST_RETIRED.PREC_DIST | 740,286,740,286 - 743,931,743,931 = -3,645,003,645 |
INT_MISC.CLEAR_RESTEER_CYCLES | 1,323,001,323 - 1,404,001,404 = -81,000,081 |
INT_MISC.RECOVERY_CYCLES | 2,160,002,160 - 2,376,002,376 = -216,000,216 |
L1D_PEND_MISS.FB_FULL:cmask=1 | Not changed, 0 |
L1D_PEND_MISS.PENDING | Not changed, 567,000,567 |
L2_RQSTS.RFO_HIT | Not changed, 2,700,027 |
LD_BLOCKS.NO_SR | Not changed, 0 |
LD_BLOCKS.STORE_FORWARD | 1,350,027 - 4,050,081 = -2,700,054 |
LD_BLOCKS_PARTIAL.ADDRESS_ALIAS | 114,752,295 - 14,850,297 = 99,901,998 |
MACHINE_CLEARS.COUNT | Not changed, 0 |
MEM_INST_RETIRED.ALL_STORES_PS | 87,210,087,210 - 84,645,084,645 = 2,565,002,565 |
MEM_INST_RETIRED.LOCK_LOADS_PS | 4,050,243 - 0 = 4,050,243 |
MEM_INST_RETIRED.SPLIT_LOADS_PS | Not changed, 0 |
MEM_INST_RETIRED.SPLIT_STORES_PS | Not changed, 0 |
MEM_INST_RETIRED.STLB_MISS_LOADS_PS | Not changed, 0 |
MEM_INST_RETIRED.STLB_MISS_STORES_PS | Not changed, 0 |
MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS | 540,270 - 0 = 540,270 |
MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT_PS | 270,135 - 0 = 270,135 |
MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS | Not changed, 0 |
MEM_LOAD_RETIRED.FB_HIT_PS | Not changed, 18,900,378 |
MEM_LOAD_RETIRED.L1_HIT_PS | 114,804,114,804 - 110,322,110,322 = 4,482,004,482 |
MEM_LOAD_RETIRED.L1_MISS_PS | 16,200,324 - 21,600,432 = -5,400,108 |
MEM_LOAD_RETIRED.L2_HIT_PS | Not changed, 22,950,459 |
MEM_LOAD_RETIRED.L3_HIT_PS | 2,701,080 - 2,025,810 = 675,270 |
MEM_LOAD_RETIRED.L3_MISS_PS | 0 - 1,350,081 = -1,350,081 |
OFFCORE_REQUESTS_BUFFER.SQ_FULL | Not changed, 0 |
OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD:cmask=4 | 162,000,162 - 81,000,081 = 81,000,081 |
OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD | 972,000,972 - 567,000,567 = 405,000,405 |
OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO | 513,000,513 - 324,000,324 = 189,000,189 |
OTHER_ASSISTS.ANY | Not changed, 0 |
PARTIAL_RAT_STALLS.SCOREBOARD | 1,053,001,053 - 918,000,918 = 135,000,135 |
UOPS_DISPATCHED_PORT.PORT_0 | 138,834,138,834 - 132,273,132,273 = 6,561,006,561 |
UOPS_DISPATCHED_PORT.PORT_1 | 135,216,135,216 - 128,547,128,547 = 6,669,006,669 |
UOPS_DISPATCHED_PORT.PORT_2 | 75,168,075,168 - 72,279,072,279 = 2,889,002,889 |
UOPS_DISPATCHED_PORT.PORT_3 | 80,487,080,487 - 76,491,076,491 = 3,996,003,996 |
UOPS_DISPATCHED_PORT.PORT_4 | 89,289,089,289 - 90,018,090,018 = -729,000,729 |
UOPS_DISPATCHED_PORT.PORT_5 | 148,716,148,716 - 145,341,145,341 = 3,375,003,375 |
UOPS_DISPATCHED_PORT.PORT_6 | 200,934,200,934 - 181,062,181,062 = 19,872,019,872 |
UOPS_DISPATCHED_PORT.PORT_7 | 51,948,051,948 - 55,512,055,512 = -3,564,003,564 |
UOPS_EXECUTED.CORE_CYCLES_GE_1 | 293,625,293,625 - 263,574,263,574 = 30,051,030,051 |
UOPS_EXECUTED.CORE_CYCLES_GE_2 | 256,716,256,716 - 235,008,235,008 = 21,708,021,708 |
UOPS_EXECUTED.CORE_CYCLES_GE_3 | 181,710,181,710 - 178,443,178,443 = 3,267,003,267 |
UOPS_EXECUTED.CORE_CYCLES_NONE | 15,606,015,606 - 12,528,012,528 = 3,078,003,078 |
UOPS_EXECUTED.THREAD | 862,677,862,677 - 829,521,829,521 = 33,156,033,156 |
UOPS_EXECUTED.X87 | Not changed, 0 |
UOPS_ISSUED.ANY | 925,965,925,965 - 924,372,924,372 = 1,593,001,593 |
UOPS_ISSUED.VECTOR_WIDTH_MISMATCH | Not changed, 0 |
UOPS_RETIRED.MACRO_FUSED | 29,700,029,700 - 30,078,030,078 = -378,000,378 |
UOPS_RETIRED.RETIRE_SLOTS | 908,982,908,982 - 897,723,897,723 = 11,259,011,259 |
IDQ.DSB_UOPS
seems to indicate that much fewer µops are delivered from the µop cache in the .NET 6.0 code. So this looks to be some microarchitectural detail. In particular, the following shows the functions where it seems we have µop cache problems in the .NET 6 version:
(I have been unable to get the same data for .NET 4.8 as VTune does not seem to be able to resolve the functions)
Looking at op_Explicit
in the diff above it looks like it is 64-byte aligned in .NET 4.8 but only 16-byte aligned in .NET 6.0. Not sure if a fluke in that particular run, but certainly something to investigate further.
This seems to be related to method ordering and the addresses that the benchmark code ultimately ends up at. The order that .NET 4.8 ends up with is particularly good for the µop cache, it seems. I can regress the .NET 4.8 case by 15% simply by adding more code that is never called during the benchmark, causing the benchmark code to end up being allocated at different addresses.
In the good case the BDN code looks like this and we get the following results:
Method | Mean | Error | StdDev | Code Size |
---|---|---|---|---|
TestLongMultiply | 2.964 s | 0.0235 s | 0.0196 s | 1,295 B |
In those cases, the method orderings and addresses look like this: | Address | Delta | Name |
---|---|---|---|
7FFA37C59FE0 | 0 | Fixed.op_Implicit(Double) | |
7FFA37C5A070 | 144 | Program+BenchmarkClass.TestLongMultiply() | |
7FFA37C5A440 | 976 | Fixed.op_Explicit(Fixed) | |
7FFA37C5A490 | 80 | Program+BenchmarkClass.TestLongMultiply(Double, Double) | |
7FFA37C5A570 | 224 | Fixed.LongMultiply(Fixed, Fixed) |
These addresses seem consistent between runs. Now let's add some random code to the constructor of the benchmark class. Here the function Churn
allows us to generate a controllable amount of code to be jitted:
diff --git a/good.cs b/bad.cs
index c5c281a..20b96be 100644
--- a/good.cs
+++ b/bad.cs
@@ -44,16 +44,30 @@ public static class Program
public BenchmarkClass()
{
double[] testValues = new double[iters];
Random rand = new Random(3);
for (int i = 0; i < testValues.Length; i++)
testValues[i] = rand.NextDouble();
_testValues = testValues;
+ Churn(18, new GenT<int> { Value = 3 });
+ }
+
+ struct GenT<T>
+ {
+ public T Value;
+ }
+
+ private int Churn<T>(int amount, GenT<T> value) where T : struct
+ {
+ if (amount == 0)
+ return 0;
+
+ return Churn<GenT<T>>(amount - 1, new GenT<GenT<T>> { Value = value }) + 1;
}
Now the results look like this: | Method | Mean | Error | StdDev | Code Size |
---|---|---|---|---|---|
TestLongMultiply | 3.426 s | 0.0668 s | 0.0915 s | 1,295 B |
The benchmark regressed by 15%. The ordering and the deltas are the same, but the addresses are different: | Address | Delta | Name |
---|---|---|---|
7FFA37C6A6B0 | 0 | Fixed.op_Implicit(Double) | |
7FFA37C6A740 | 144 | Program+BenchmarkClass.TestLongMultiply() | |
7FFA37C6AB10 | 976 | Fixed.op_Explicit(Fixed) | |
7FFA37C6AB60 | 80 | Program+BenchmarkClass.TestLongMultiply(Double, Double) | |
7FFA37C6AC40 | 224 | Fixed.LongMultiply(Fixed, Fixed) |
We can add even more code to get fast benchmarks again, for example at n=30: | Method | Mean | Error | StdDev | Code Size |
---|---|---|---|---|---|
TestLongMultiply | 3.057 s | 0.0598 s | 0.0839 s | 1,295 B |
Address | Delta | Name |
---|---|---|
7FFA37C8AB20 | 0 | Fixed.op_Implicit(Double) |
7FFA37C8ABB0 | 144 | Program+BenchmarkClass.TestLongMultiply() |
7FFA37C8AF80 | 976 | Fixed.op_Explicit(Fixed) |
7FFA37C8AFD0 | 80 | Program+BenchmarkClass.TestLongMultiply(Double, Double) |
7FFA37C8B0B0 | 224 | Fixed.LongMultiply(Fixed, Fixed) |
Now the question is if we can get the fast version in .NET 6. Without any Churn call, the results for .NET 6.0 on my machine are similar to the bad version for .NET 4.8: |
Method | Mean | Error | StdDev | Code Size |
---|---|---|---|---|---|
TestLongMultiply | 3.455 s | 0.0167 s | 0.0156 s | 1,167 B |
Address | Delta | Name |
---|---|---|
7FF9FFABAAE0 | 0 | Program+BenchmarkClass.TestLongMultiply() |
7FF9FFABAFC0 | 1248 | Fixed.op_Implicit(Double) |
7FF9FFABB050 | 144 | Fixed.op_Explicit(Fixed) |
7FF9FFABB0A0 | 80 | Program+BenchmarkClass.TestLongMultiply(Double, Double) |
7FF9FFABB180 | 224 | Fixed.LongMultiply(Fixed, Fixed) |
The first difference is that .NET 6 places the functions in a different order. The second is that the delta is larger for TestLongMultiply()
; it seems like .NET 6 is placing precode stubs in the same code heap as hot functions, and some of these end up between Fixed.op_Implicit(Double)
and Program+BenchmarkClass.TestLongMultiply()
. Those precode stubs come from the type loader loading the Math
class when TestLongMultiply
is jitted. We can avoid it by ensuring that type gets loaded before, for example by using some Math
function. Then we get:
Address | Delta | Name |
---|---|---|
7FF9FC1079D0 | 0 | Fixed.op_Implicit(Double) |
7FF9FC107A60 | 144 | Program+BenchmarkClass.TestLongMultiply() |
7FF9FC107E10 | 944 | Fixed.op_Explicit(Fixed) |
7FF9FC107E60 | 80 | Program+BenchmarkClass.TestLongMultiply(Double, Double) |
7FF9FC107F40 | 224 | Fixed.LongMultiply(Fixed, Fixed) |
From a code locality perspective this seems to be as good as we can do. Yet, even with varying the Churn
parameter I am not able to get as good results as the best .NET 4.8 results.
I think there are a couple of actionable items here:
I don't think there's much we can do here for .NET 6 since CQ seems to be on par or better, so I will move this to .NET 7 for now. However, as a workaround you can mark the explicit double conversion with [MethodImpl(MethodImplOptions.AggressiveInlining)] . With this, I get significantly better timings and .NET 6 beats .NET 4.8: |
Method | Job | Runtime | Toolchain | Mean | Error | StdDev | Ratio | RatioSD | Code Size |
---|---|---|---|---|---|---|---|---|---|---|
TestLongMultiply | Job-GIYABY | .NET 6.0 | net6.0 | 2.218 s | 0.0141 s | 0.0125 s | 1.00 | 0.00 | 1,211 B | |
TestLongMultiply | Job-BOXOPL | .NET Framework 4.8 | net48 | 2.461 s | 0.0353 s | 0.0407 s | 1.11 | 0.02 | 1,349 B |
I will close this given the inlining workaround above, and given that this was mainly micro-architectural. We also have the possibility of doing method ordering in R2R now which addresses some of the concerns around that.
I was benchmarking the following code (64 bit fixed point type multiplication) under .NET 4.8 and .NET Core 3.1. Due to the jitter improvements made in .NET Core I expected the performance to be better. Unfortunately this code consistently runs about 12% slower than in .NET 4.8:
.NET 4.8: 2.66 sec .NET Core 3.1: 2.99 sec
(i7 7700k, both projects in Release and "Prefer 32 bit" option disabled)
Please find attached VS solution with both projects: FixedBenchmark.zip
category:cq theme:needs-triage skill-level:expert cost:large