Open johnnygiter opened 1 week ago
@amanasifkhalid seems quite likely layout related, can you take a look?
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
From the .NET 9 JitDump of PerformanceTest::Run
, here's the initial RPO layout:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds weight IBC [IL range] [jump] [EH region] [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0091] 1 1 100 [???..???)-> BB02(1) (always) i IBC internal
BB02 [0000] 1 BB01 1 100 [000..03D)-> BB05(0.387),BB04(0.613) ( cond ) i IBC hascall gcsafe newobj
BB04 [0020] 1 BB02 0.61 61 [03C..03D)-> BB12(1) (always) i IBC hascall gcsafe
BB05 [0021] 1 BB02 0.39 39 [03C..03D)-> BB07(0.544),BB06(0.456) ( cond ) i IBC
BB07 [0027] 1 BB05 0.21 21 [03C..03D)-> BB08(1) (always) i IBC idxlen nullcheck
BB06 [0026] 1 BB05 0.18 18 [03C..03D)-> BB08(1) (always) i IBC
BB08 [0028] 2 BB06,BB07 0.39 39 [03C..03D)-> BB10(0.544),BB09(0.456) ( cond ) i IBC
BB10 [0032] 1 BB08 0.21 21 [03C..03D)-> BB11(1) (always) i IBC idxlen nullcheck
BB09 [0031] 1 BB08 0.18 18 [03C..03D)-> BB11(1) (always) i IBC
BB11 [0033] 2 BB09,BB10 0.39 39 [03C..03D)-> BB12(1) (always) i IBC hascall gcsafe
BB12 [0022] 2 BB04,BB11 1 100 [03C..071)-> BB34(0.1),BB13(0.9) ( cond ) i IBC hascall gcsafe newobj
BB13 [0094] 2 BB12,BB33 9.00 900 [071..07D)-> BB16(0.48),BB15(0.52) ( cond ) i IBC loophead bwd
BB15 [0002] 1 BB13 4.68 468 [07D..080)-> BB16(1) (always) i IBC bwd
BB16 [0003] 2 BB13,BB15 9.00 900 [080..086)-> BB22(0.1),BB17(0.9) ( cond ) i IBC bwd
BB17 [0095] 1 BB16 8.10 810 [086..???)-> BB18(1) (always) IBC internal
BB18 [0004] 2 BB17,BB21 81.00 8100 [086..08B)-> BB20(0.48),BB19(0.52) ( cond ) i IBC loophead bwd bwd-target
BB19 [0005] 1 BB18 42.12 4212 [08B..0A3)-> BB21(1) (always) i IBC idxlen bwd
BB20 [0006] 1 BB18 38.88 3888 [0A3..0B2)-> BB21(1) (always) i IBC bwd
BB21 [0007] 2 BB19,BB20 81.00 8100 [0B2..0C1)-> BB18(0.9),BB22(0.1) ( cond ) i IBC bwd
BB22 [0096] 2 BB16,BB21 9.00 900 [000..0E4)-> BB26(0),BB24(1) ( cond ) i IBC newobj bwd
BB24 [0050] 1 BB22 9.00 900 [000..000)-> BB27(0.00533),BB25(0.995) ( cond ) i IBC internal nullcheck bwd
BB25 [0054] 1 BB24 8.95 895 [000..000)-> BB27(1) (always) i IBC internal hascall gcsafe idxlen nullcheck bwd
BB26 [0051] 1 BB22 0 0 [000..000)-> BB27(1) (always) i IBC rare internal hascall gcsafe bwd
BB27 [0052] 3 BB24,BB25,BB26 9.00 900 [0E4..0E5)-> BB29(0.0319),BB28(0.968) ( cond ) i IBC idxlen newobj nullcheck bwd
BB28 [0066] 1 BB27 1.86 186 [0E4..0E5)-> BB30(1) (always) i IBC nullcheck bwd
BB29 [0071] 1 BB27 0.28 28 [0E4..0E5)-> BB30(1) (always) i IBC hascall gcsafe bwd
BB30 [0073] 2 BB28,BB29 9.00 900 [0E4..0FA)-> BB32(0.0319),BB31(0.968) ( cond ) i IBC hascall gcsafe idxlen newobj nullcheck bwd
BB31 [0083] 1 BB30 1.86 186 [0F9..0FA)-> BB33(1) (always) i IBC nullcheck bwd
BB32 [0088] 1 BB30 0.28 28 [0F9..0FA)-> BB33(1) (always) i IBC hascall gcsafe bwd
BB33 [0090] 2 BB31,BB32 9.00 900 [0F9..110)-> BB13(0.9),BB34(0.1) ( cond ) i IBC newobj nullcheck bwd
BB34 [0097] 2 BB12,BB33 1 100 [110..117) (return) i IBC gcsafe
BB36 [0098] 0 0 [???..???) (throw ) i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
There's plenty of opportunity to improve fallthrough on the hot paths, but this looks reasonable. The problem is fgMoveHotJumps
is too aggressive in creating fallthrough for BB19 -> BB21
, such that once we've moved BB20
out of the way, we give up on keeping it in the loop body, so we keep pushing BB20
down the method to create fallthrough on less important paths. Here's the final layout:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds weight IBC [IL range] [jump] [EH region] [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0091] 1 1 100 [???..???)-> BB02(1) (always) i IBC internal
BB02 [0000] 1 BB01 1 100 [000..03D)-> BB05(0.387),BB04(0.613) ( cond ) i IBC hascall gcsafe newobj
BB04 [0020] 1 BB02 0.61 61 [03C..03D)-> BB12(1) (always) i IBC hascall gcsafe
BB12 [0022] 2 BB04,BB11 1 100 [03C..071)-> BB34(0.1),BB13(0.9) ( cond ) i IBC hascall gcsafe newobj
BB13 [0094] 2 BB12,BB33 9.00 900 [071..07D)-> BB16(0.48),BB15(0.52) ( cond ) i IBC loophead bwd
BB15 [0002] 1 BB13 4.68 468 [07D..080)-> BB16(1) (always) i IBC bwd
BB16 [0003] 2 BB13,BB15 9.00 900 [080..086)-> BB22(0.1),BB17(0.9) ( cond ) i IBC bwd
BB17 [0095] 1 BB16 8.10 810 [086..???)-> BB18(1) (always) IBC internal
BB18 [0004] 2 BB17,BB21 81.00 8100 [086..08B)-> BB20(0.48),BB19(0.52) ( cond ) i IBC loophead bwd bwd-target
BB19 [0005] 1 BB18 42.12 4212 [08B..0A3)-> BB21(1) (always) i IBC idxlen bwd
BB21 [0007] 2 BB19,BB20 81.00 8100 [0B2..0C1)-> BB18(0.9),BB22(0.1) ( cond ) i IBC bwd
BB22 [0096] 2 BB16,BB21 9.00 900 [000..0E4)-> BB26(0),BB24(1) ( cond ) i IBC newobj bwd
BB24 [0050] 1 BB22 9.00 900 [000..000)-> BB27(0.00533),BB25(0.995) ( cond ) i IBC internal nullcheck bwd
BB25 [0054] 1 BB24 8.95 895 [000..000)-> BB27(1) (always) i IBC internal hascall gcsafe idxlen nullcheck bwd
BB27 [0052] 3 BB24,BB25,BB26 9.00 900 [0E4..0E5)-> BB29(0.0319),BB28(0.968) ( cond ) i IBC idxlen newobj nullcheck bwd
BB28 [0066] 1 BB27 1.86 186 [0E4..0E5)-> BB30(1) (always) i IBC nullcheck bwd
BB30 [0073] 2 BB28,BB29 9.00 900 [0E4..0FA)-> BB32(0.0319),BB31(0.968) ( cond ) i IBC hascall gcsafe idxlen newobj nullcheck bwd
BB31 [0083] 1 BB30 1.86 186 [0F9..0FA)-> BB33(1) (always) i IBC nullcheck bwd
BB33 [0090] 2 BB31,BB32 9.00 900 [0F9..110)-> BB13(0.9),BB34(0.1) ( cond ) i IBC newobj nullcheck bwd
BB34 [0097] 2 BB12,BB33 1 100 [110..117) (return) i IBC gcsafe
BB05 [0021] 1 BB02 0.39 39 [03C..03D)-> BB07(0.544),BB06(0.456) ( cond ) i IBC
BB07 [0027] 1 BB05 0.21 21 [03C..03D)-> BB08(1) (always) i IBC idxlen nullcheck
BB08 [0028] 2 BB06,BB07 0.39 39 [03C..03D)-> BB10(0.544),BB09(0.456) ( cond ) i IBC
BB10 [0032] 1 BB08 0.21 21 [03C..03D)-> BB11(1) (always) i IBC idxlen nullcheck
BB11 [0033] 2 BB09,BB10 0.39 39 [03C..03D)-> BB12(1) (always) i IBC hascall gcsafe
BB06 [0026] 1 BB05 0.18 18 [03C..03D)-> BB08(1) (always) i IBC
BB09 [0031] 1 BB08 0.18 18 [03C..03D)-> BB11(1) (always) i IBC
BB20 [0006] 1 BB18 38.88 3888 [0A3..0B2)-> BB21(1) (always) i IBC bwd
BB29 [0071] 1 BB27 0.28 28 [0E4..0E5)-> BB30(1) (always) i IBC hascall gcsafe bwd
BB32 [0088] 1 BB30 0.28 28 [0F9..0FA)-> BB33(1) (always) i IBC hascall gcsafe bwd
BB26 [0051] 1 BB22 0 0 [000..000)-> BB27(1) (always) i IBC rare internal hascall gcsafe bwd
BB36 [0098] 0 0 [???..???) (throw ) i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Disabling the new layout with DOTNET_JitDoReversePostOrderLayout=0
doesn't do much better:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds weight IBC [IL range] [jump] [EH region] [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0091] 1 1 100 [???..???)-> BB02(1) (always) i IBC internal
BB02 [0000] 1 BB01 1 100 [000..03D)-> BB05(0.387),BB04(0.613) ( cond ) i IBC hascall gcsafe newobj
BB04 [0020] 1 BB02 0.61 61 [03C..03D)-> BB12(1) (always) i IBC hascall gcsafe
BB12 [0022] 2 BB04,BB11 1 100 [03C..071)-> BB39(0.1),BB13(0.9) ( cond ) i IBC hascall gcsafe newobj
BB13 [0094] 2 BB12,BB33 9.00 900 [071..07D)-> BB16(0.48),BB15(0.52) ( cond ) i IBC loophead bwd
BB15 [0002] 1 BB13 4.68 468 [07D..080)-> BB16(1) (always) i IBC bwd
BB16 [0003] 2 BB13,BB15 9.00 900 [080..086)-> BB22(0.1),BB17(0.9) ( cond ) i IBC bwd
BB17 [0095] 1 BB16 8.10 810 [086..???)-> BB18(1) (always) IBC internal
BB18 [0004] 2 BB17,BB21 81.00 8100 [086..08B)-> BB20(0.48),BB19(0.52) ( cond ) i IBC loophead bwd bwd-target
BB19 [0005] 1 BB18 42.12 4212 [08B..0A3)-> BB21(1) (always) i IBC idxlen bwd
BB21 [0007] 2 BB19,BB20 81.00 8100 [0B2..0C1)-> BB18(0.9),BB22(0.1) ( cond ) i IBC bwd
BB22 [0096] 2 BB16,BB21 9.00 900 [000..0E4)-> BB26(0),BB24(1) ( cond ) i IBC newobj bwd
BB24 [0050] 1 BB22 9.00 900 [000..000)-> BB27(0.00533),BB25(0.995) ( cond ) i IBC internal nullcheck bwd
BB25 [0054] 1 BB24 8.95 895 [000..000)-> BB27(1) (always) i IBC internal hascall gcsafe idxlen nullcheck bwd
BB27 [0052] 3 BB24,BB25,BB26 9.00 900 [0E4..0E5)-> BB29(0.0319),BB37(0.968) ( cond ) i IBC idxlen newobj nullcheck bwd
BB37 [0099] 1 BB27 1.86 186 [0E4..0E5)-> BB30(1) (always) i IBC nullcheck bwd
BB30 [0073] 2 BB29,BB37 9.00 900 [0E4..0FA)-> BB32(0.0319),BB38(0.968) ( cond ) i IBC hascall gcsafe idxlen newobj nullcheck bwd
BB38 [0100] 1 BB30 1.86 186 [0F9..0FA)-> BB33(1) (always) i IBC nullcheck bwd
BB33 [0090] 2 BB32,BB38 9.00 900 [0F9..110)-> BB13(0.9),BB39(0.1) ( cond ) i IBC newobj nullcheck bwd
BB39 [0101] 2 BB12,BB33 1 100 [110..117) (return) i IBC gcsafe
BB20 [0006] 1 BB18 38.88 3888 [0A3..0B2)-> BB21(1) (always) i IBC bwd
BB05 [0021] 1 BB02 0.39 39 [03C..03D)-> BB06(0.456),BB07(0.544) ( cond ) i IBC
BB07 [0027] 1 BB05 0.21 21 [03C..03D)-> BB08(1) (always) i IBC idxlen nullcheck
BB08 [0028] 2 BB06,BB07 0.39 39 [03C..03D)-> BB09(0.456),BB10(0.544) ( cond ) i IBC
BB10 [0032] 1 BB08 0.21 21 [03C..03D)-> BB11(1) (always) i IBC idxlen nullcheck
BB11 [0033] 2 BB09,BB10 0.39 39 [03C..03D)-> BB12(1) (always) i IBC hascall gcsafe
BB29 [0071] 1 BB27 0.28 28 [0E4..0E5)-> BB30(1) (always) i IBC hascall gcsafe bwd
BB32 [0088] 1 BB30 0.28 28 [0F9..0FA)-> BB33(1) (always) i IBC hascall gcsafe bwd
BB09 [0031] 1 BB08 0.18 18 [03C..03D)-> BB11(1) (always) i IBC
BB06 [0026] 1 BB05 0.18 18 [03C..03D)-> BB08(1) (always) i IBC
BB26 [0051] 1 BB22 0 0 [000..000)-> BB27(1) (always) i IBC rare internal hascall gcsafe bwd
BB36 [0098] 0 0 [???..???) (throw ) i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
With a .NET 10 build of my 3-opt layout prototype in #103450, the 3-opt pass makes the loop blocks contiguous again, though the ordering is weird:
Creating fallthrough for BB21 -> BB18 (cost=5832.000000, improvement=7290.000000)
Creating fallthrough for BB20 -> BB21 (cost=0.000000, improvement=3888.000000)
Creating fallthrough for BB33 -> BB13 (cost=405.000000, improvement=810.000000)
Creating fallthrough for BB31 -> BB33 (cost=0.000000, improvement=112.500000)
Creating fallthrough for BB12 -> BB34 (cost=0.000000, improvement=10.000000)
*************** Before renumbering the basic blocks
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds weight IBC [IL range] [jump] [EH region] [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0106] 1 1 100 [???..???)-> BB02(1) (always) i IBC internal
BB02 [0000] 1 BB01 1 100 [000..03D)-> BB05(0.584),BB04(0.416) ( cond ) i IBC hascall gcsafe newobj
BB05 [0021] 1 BB02 0.58 58 [03C..03D)-> BB07(0.519),BB06(0.481) ( cond ) i IBC
BB07 [0027] 1 BB05 0.30 30 [03C..03D)-> BB08(1) (always) i IBC idxlen nullcheck
BB08 [0028] 2 BB06,BB07 0.58 58 [03C..03D)-> BB10(0.519),BB09(0.481) ( cond ) i IBC
BB10 [0032] 1 BB08 0.30 30 [03C..03D)-> BB11(1) (always) i IBC idxlen nullcheck
BB11 [0033] 2 BB09,BB10 0.58 58 [03C..03D)-> BB12(1) (always) i IBC hascall gcsafe
BB12 [0022] 2 BB04,BB11 1 100 [03C..071)-> BB34(0.1),BB13(0.9) ( cond ) i IBC hascall gcsafe newobj
BB34 [0112] 2 BB12,BB33 1 100 [110..117) (return) i IBC gcsafe
BB31 [0093] 1 BB30 1.13 113 [0F9..0FA)-> BB33(1) (always) i IBC nullcheck bwd
BB33 [0100] 2 BB31,BB32 9.00 900 [0F9..110)-> BB13(0.9),BB34(0.1) ( cond ) i IBC newobj nullcheck bwd
BB13 [0109] 2 BB12,BB33 9.00 900 [071..07D)-> BB16(0.48),BB15(0.52) ( cond ) i IBC loophead bwd
BB15 [0002] 1 BB13 4.68 468 [07D..080)-> BB16(1) (always) i IBC bwd
BB16 [0003] 2 BB15,BB13 9.00 900 [080..086)-> BB22(0.1),BB17(0.9) ( cond ) i IBC bwd
BB17 [0110] 1 BB16 8.10 810 [086..???)-> BB18(1) (always) IBC internal
BB20 [0006] 1 BB18 38.88 3888 [0A3..0B2)-> BB21(1) (always) i IBC bwd
BB21 [0007] 2 BB19,BB20 81.00 8100 [0B2..0C1)-> BB18(0.9),BB22(0.1) ( cond ) i IBC bwd
BB18 [0004] 2 BB21,BB17 81.00 8100 [086..08B)-> BB20(0.48),BB19(0.52) ( cond ) i IBC loophead bwd bwd-target
BB19 [0005] 1 BB18 42.12 4212 [08B..0A3)-> BB21(1) (always) i IBC idxlen bwd
BB22 [0111] 2 BB16,BB21 9.00 900 [000..0E4)-> BB26(0),BB24(1) ( cond ) i IBC newobj bwd
BB24 [0055] 1 BB22 9.00 900 [000..000)-> BB27(0.0105),BB25(0.989) ( cond ) i IBC internal nullcheck bwd
BB25 [0059] 1 BB24 8.91 891 [000..000)-> BB27(1) (always) i IBC internal hascall gcsafe idxlen nullcheck bwd
BB27 [0057] 3 BB24,BB26,BB25 9.00 900 [0E4..0E5)-> BB29(0.5),BB28(0.5) ( cond ) i IBC idxlen newobj nullcheck bwd
BB28 [0071] 1 BB27 1.13 113 [0E4..0E5)-> BB30(1) (always) i IBC nullcheck bwd
BB06 [0026] 1 BB05 0.28 28 [03C..03D)-> BB08(1) (always) i IBC
BB09 [0031] 1 BB08 0.28 28 [03C..03D)-> BB11(1) (always) i IBC
BB04 [0020] 1 BB02 0.42 42 [03C..03D)-> BB12(1) (always) i IBC hascall gcsafe
BB29 [0076] 1 BB27 2.25 225 [0E4..0E5)-> BB30(1) (always) i IBC hascall gcsafe bwd
BB30 [0078] 2 BB28,BB29 9.00 900 [0E4..0FA)-> BB32(0.5),BB31(0.5) ( cond ) i IBC hascall gcsafe idxlen newobj nullcheck bwd
BB32 [0098] 1 BB30 2.25 225 [0F9..0FA)-> BB33(1) (always) i IBC hascall gcsafe bwd
BB26 [0056] 1 BB22 0 0 [000..000)-> BB27(1) (always) i IBC rare internal hascall gcsafe bwd
BB36 [0113] 0 0 [???..???) (throw ) i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Running 3-opt again, or skipping fgMoveHotJumps
beforehand, doesn't fix the branchy loop block ordering, probably because the cost model doesn't differentiate between jump types -- this example is a good motivator for developing a cost model that doesn't rely on edge weights alone.
Running 3-opt again, or skipping
fgMoveHotJumps
beforehand, doesn't fix the branchy loop block ordering, probably because the cost model doesn't differentiate between jump types -- this example is a good motivator for developing a cost model that doesn't rely on edge weights alone
By this you mean differing costs for BBJ_ALWAYS vs BBJ_COND?
Also do we know if the perf is a consequence of the new layout, or are we running into JCC errata or the like?
Also do we know if the perf is a consequence of the new layout, or are we running into JCC errata or the like?
I see multiple instances of JCC errata in both .NET 8 and .NET 9's disasms, and both incur the mitigation penalty in the method's hottest loop. .NET 8:
G_M000_IG15: ;; offset=0x0261
cmp ecx, ebx
jge G_M000_IG24
G_M000_IG16: ;; offset=0x0269
mov eax, ecx
sub eax, r15d
cmp eax, 10
jae G_M000_IG40
cmp ecx, dword ptr [rdi+0x08]
jae G_M000_IG40
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (jae: 0 ; jcc erratum) 32B boundary ...............................
mov edx, ecx
movzx rdx, word ptr [rdi+2*rdx+0x0C]
mov word ptr [r14+2*rax], dx
jmp G_M000_IG25
G_M000_IG17: ;; offset=0x0291
mov rdx, gword ptr [rbp-0x68]
test rdx, rdx
jne SHORT G_M000_IG19
.NET 9 (Checked build, hence the bbWeights):
G_M61133_IG16: ;; offset=0x0240
cmp ecx, ebx
jge G_M61133_IG37
;; size=8 bbWeight=81.00 PerfScore 101.25
G_M61133_IG17: ;; offset=0x0248
mov eax, ecx
sub eax, r12d
cmp eax, 10
jae G_M61133_IG41
cmp ecx, dword ptr [r15+0x08]
jae G_M61133_IG41
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (jae: 0 ; jcc erratum) 32B boundary ...............................
movzx rdx, word ptr [r15+2*rcx+0x0C]
mov word ptr [r13+2*rax], dx
;; size=36 bbWeight=42.12 PerfScore 368.55
G_M61133_IG18: ;; offset=0x026C
inc ecx
cmp ecx, edi
jl SHORT G_M61133_IG16
;; size=6 bbWeight=81.00 PerfScore 121.50
The .NET 9 layout has one more JCC erratum site overall, but it's not in a consequential place, so I'd expect the sites in the above loop to dominate the overall JCC erratum penalty. Here are the benchmark results on my AMD machine, which doesn't suffer from JCC erratum penalties:
| Method | Job | Runtime | n | Mean | Error | StdDev | Ratio | Code Size |
|------- |--------- |--------- |------ |---------:|--------:|--------:|------:|----------:|
| Run | .NET 8.0 | .NET 8.0 | 80000 | 496.9 ms | 0.46 ms | 0.41 ms | 1.00 | 10,012 B |
| Run | .NET 9.0 | .NET 9.0 | 80000 | 536.2 ms | 0.36 ms | 0.33 ms | 1.08 | 9,321 B |
@johnnygiter could you please share what CPU you ran this benchmark on? Thanks!
@amanasifkhalid, Intel Core i5-14600K, 1 CPU, 20 logical and 14 physical cores, Win 11 latest. I had similar difference on i9 14900K on other workstation,
Not only it's not impressive regarding past .net core versions (low or no improvement) but using same algo even slow Python is much much faster(over 300%), I know it's hard to compare anyway .net suffer somehow in this type of bechmark.
I wonder, where is this great improvements of BigInteger then? (works even slower than 8)