JIT: Generalized struct promotion

jakobbotsch commented 2 years ago

Description

Struct promotion (a.k.a. scalar replacement of aggregates) is an optimization that replaces structs with their constituent fields, allowing those fields to be optimized as if they were normal local variables. This is a very important optimization for low-level performance oriented code that makes heavy use of structs, so it is important that it is supported well by the JIT.

Limitations

The JIT supports promotion but with the following limitations today:

Only whole structs with at most 4 fields can be promoted
Nested structs are not supported, except when the nested struct is a wrapper around a primitive type
A struct must be promoted for the full duration of the function or not at all
Structs with overlapping fields are not supported

This issue is about removing (some of) these limitations.

Q1 work items

[x] Initial prototype (https://github.com/dotnet/runtime/pull/83388)

Q2 work items

[x] https://github.com/dotnet/performance/pull/2991
[x] https://github.com/dotnet/runtime/pull/84645
[x] Efficient decomposition for assignments
- [x] https://github.com/dotnet/runtime/pull/85105
- [x] https://github.com/dotnet/runtime/pull/85323
[x] https://github.com/dotnet/runtime/pull/86043
[x] https://github.com/dotnet/runtime/pull/86792
[x] https://github.com/dotnet/runtime/pull/85909
[x] https://github.com/dotnet/runtime/pull/86660
[x] https://github.com/dotnet/runtime/pull/87165
[x] https://github.com/dotnet/runtime/pull/87217
[x] https://github.com/dotnet/runtime/pull/87265
[x] https://github.com/dotnet/runtime/pull/87410
[x] https://github.com/dotnet/runtime/pull/87810
[x] https://github.com/dotnet/runtime/pull/87745
[x] https://github.com/dotnet/runtime/pull/87809
[x] https://github.com/dotnet/runtime/pull/87869
[x] https://github.com/dotnet/runtime/pull/87969
[x] https://github.com/dotnet/runtime/pull/87917
[x] Investigate regressions (see https://github.com/dotnet/runtime/pull/88090#issuecomment-1612026042)
[x] https://github.com/dotnet/runtime/pull/88090

Future work items

CQ

[ ] Decomposition via arithmetic
[ ] Share backing storage when promoted locals are stack spilled
[ ] Allow small-typed mismatches
[ ] Support some bitcasts
[ ] Partial lifetime promotion
[ ] Move ABI info determination earlier so it can be utilized
[ ] Store-forwarding for call args passed in registers
[ ] Load-forwarding for calls
[ ] Store-forwarding for structs returned in registers (#86388; punted due to low impact; existing promotion takes care of it)
[ ] Custom class layouts with GC pointers (for some efficient decomposed copies)
[x] https://github.com/dotnet/runtime/issues/86711
[ ] Readback/writeback resolution
[ ] Full support for GetElement/WithElement for SIMDs
[ ] Assignment decomposition for GetElement/WithElement (https://github.com/dotnet/runtime/issues/76928#issuecomment-1582618505)

Throughput

[ ] More efficient accesses data structure
[ ] Stop tracking accesses early when we know we won't promote (e.g. due to overlapping accesses)
[X] https://github.com/dotnet/runtime/pull/87729
[X] https://github.com/dotnet/runtime/pull/87997

Related issues

ghost commented 2 years ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details

## Description Struct promotion is an optimization that replaces structs with their constituent fields, allowing those fields to be optimized as if they were normal local variables. This is a very important optimization for low-level performance oriented code that makes heavy use of structs, so it is important that it is supported well by the JIT. ### Limitations The JIT supports promotion but with the following limitations today: * Only whole structs with at most 4 fields can be promoted * Nested structs are not supported, except when the nested struct is a wrapper around a primitive type * A struct must be promoted for the full duration of the function or not at all * Structs with overlapping fields are not supported This issue is about removing (some of) these limitations. ### Plan The preliminary idea is to introduce a new pass that replaces struct fields by new local variables and the "whole struct value" by the reassembling of the promoted fields and the residual fields. The pass will need the proper heuristics to figure out which fields to promote (depending on in which contexts they are used), and potentially in which parts of the function (e.g. due to being address exposed on some paths). It is likely that some form of struct liveness will be needed by this pass and the hope is that the liveness pass from #76069 will be beneficial here as well. One difficulty is in the representation of multi-reg args and returns at the ABI boundaries. Today they more or less "fall out" from the whole-promotion representation by using the parent struct local as the use/def. A new representation will likely be needed if structs no longer need to be entirely promoted. Initially I expect we can piggyback on the existing mechanism to get to a working prototype though, however as a long term goal it would be nice to replace the existing mechanism entirely.

Author:	jakobbotsch
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

jakobbotsch commented 2 years ago

cc @dotnet/jit-contrib

jakobbotsch commented 1 year ago

The kinds of struct locals we can promote appear only as operands of GT_ASG, GT_CALL and GT_RETURN, For GT_ASG we should be able to decompose directly as part of the generalized promotion pass. That leaves GT_CALL and GT_RETURN. My thinking for a while was to introduce a new node that represented the assembling of a local and its constituent fields, similar to GT_FIELD_LIST. However, it would need an equivalent to appear on the LHS of assignments from calls, and of course downstream passes would need to be taught to handle this. Also, they would only be necessary on platforms with multi-reg args/returns.

After thinking some more I've come back to another idea which I think I will investigate. Instead of introducing a new node in HIR that persists until lowering, we introduce the equivalent of GT_PUTARG_REG for GT_CALL results and GT_RETURN. That is, we would add nodes GT_GETRES_REG that represents using one of the register results of a GT_CALL node, and a GT_PUTRET_REG that represents placing a value into one of the return registers. LSRA would get similar handling for these nodes as it has for GT_PUTARG_REG today (essentially that these registers are busy until they are used (for call)/or until a GT_RETURN is encountered).

For the generalized struct promotion pass we would then only introduce new assignments: to pass one of these struct locals to a call, or to return one of them, it will just write all the constituent fields back into the struct local and leave the local in the call argument/GT_RETURN.

A priori this would just amount to a lot of unnecessary copying in the generated code. To get to an acceptable state, we would introduce an optimization in lowering to handle these patterns without copying. For example, after generalized struct promotion + rationalization, we would have LIR like:

STORE_LCL_FLD V00 [0..8), <some operand, usually a GT_LCL_VAR>
STORE_LCL_FLD V00 [8..16), <some operand 2>
t0 = LCL_VAR V00
CALL t0

we would do some analysis to figure out whether V00 is dead after this call (potentially even more precisely whether V00 [0..8)/V00 [8..16) are dead). If yes, then we would transform this into

t0 = PUTARG_REG <some operand>
t1 = PUTARG_REG <some operand 2>
CALL t0, t1 // with FIELD_LIST or however the representation is today

Similarly for call results, e.g. we would get LIR like the following for generalized promotion after rationalization:

STORE_LCL_VAR V00 (CALL abc)
STORE_LCL_VAR V01 (LCL_FLD V00 [0..8))
STORE_LCL_VAR V02 (LCL_FLD V00 [8..16))

and, as an optimization, lower it into:

CALL abc
STORE_LCL_VAR V01 (GETRES_REG rax)`
STORE_LCL_VAR V02 (GETRES_REG rdx)`

Some questions to investigate:

How bad would these IR patterns be for throughput? Certainly introducing these assignments is bloating the IR somewhat (although this seems no different than the standard FIELD_LIST transformation we do for our normal promotion)
Can the middle-end optimization passes cope with these IR patterns? They would see tracked locals being stored into some arbitrary struct local, and then that struct local being passed to a call/returned. I'm not sure if this would lose us opportunities we have with whole struct promotion today.
How do we do the analysis in lowering? Presumably it would need to be some general kind of struct liveness. For prototyping we can instead create new struct locals for every call site and utilize ref counts to do this.

tannergooding commented 1 year ago

The kinds of struct locals we can promote appear only as operands of GT_ASG, GT_CALL and GT_RETURN

What about for GT_HWINTRINSIC, particularly in the case of things like struct M2x4 { Vector4 X; Vector4 Y; } or similar?

Same for cases like GT_INTRINSIC which were originally calls and which may become calls again in some cases (e.g. Math.Pow if constant folding can't happen).

jakobbotsch commented 1 year ago

What about for GT_HWINTRINSIC, particularly in the case of things like struct M2x4 { Vector4 X; Vector4 Y; } or similar?

What particular intrinsics take arbitrary struct arguments? Can they be decomposed early like ASG would be?

SingleAccretion commented 1 year ago

There is https://github.com/dotnet/runtime/pull/80297, where we are handling an intrinsic that essentially has a "multi-reg arg" via early decomposition into a FIELD_LIST.

tannergooding commented 1 year ago

Single beat me to it, that was the example I was going to give 😄

jakobbotsch commented 1 year ago

I don't see why the existing approach there wouldn't continue to work. The representation for call args here could also be GT_FIELD_LIST but it would require very early ABI handling for the struct args that I am not a fan of.

jakobbotsch commented 1 year ago

A similar node would be needed for parameters. Generalized promotion would create IR in the start of functions to load (parts of) parameters into the promoted field locals and lowering would optimize these into some GT_GETPARAM_REG node when possible. Then likely we would use the same source of liveness when homing parameters to figure out if we can avoid homing some of the struct parameter.

jakobbotsch commented 1 year ago

Some measurements over asp.net for block copies/inits and whether they involve promoted structs:

Copies physical -> physical: 3
Copies physical -> old:      283
Copies old      -> physical: 250
Copies physical ->         : 65
Copies          -> physical: 268
Inits           -> physical: 37

("old" means structs that are promoted by the normal mechanism)

It would be great to reuse block morphing to do the decomposition, but I'm not sure how simple that would be -- the decomposition for copies involving physically promoted structs is quite a bit more complicated.

Same measurements with old promotion disabled:

Copies physical -> physical: 162
Copies physical -> old:      0
Copies old      -> physical: 0
Copies physical ->         : 1332
Copies          -> physical: 6034
Inits           -> physical: 99

jakobbotsch commented 1 year ago

We frequently see promotion opportunities for standard C# code iterating lists via List<T>.Enumerator, e.g.: https://github.com/dotnet/aspnetcore/blob/8968058c9e5fdfdd1242426a03dc80609997edab/src/Servers/Kestrel/Core/src/Internal/Infrastructure/KestrelConnection.cs#L51-L54

Where the codegen for the loop ends up with the following diff:

@@ -1,35 +1,33 @@
 G_M45156_IG09:
-       mov      rax, gword ptr [rbp-28H]
-       mov      rdx, gword ptr [rbp-20H]
+       mov      rax, gword ptr [rbp-38H]
+       mov      rdx, gword ptr [rbp-30H]
        mov      rcx, gword ptr [rax+08H]
        call     [rax+18H]System.Action`1[System.__Canon]:Invoke(System.__Canon):this
                        ;; size=15 bbWeight=1 PerfScore 7.00
 G_M45156_IG10:
-       mov      rcx, gword ptr [rbp-38H]
-       mov      esi, dword ptr [rbp-2CH]
-       mov      edi, dword ptr [rcx+14H]
-       cmp      esi, edi
+       mov      rcx, rsi
+       mov      r14d, dword ptr [rcx+14H]
+       cmp      edi, r14d
        jne      SHORT G_M45156_IG14
-       mov      edx, dword ptr [rbp-30H]
-       cmp      edx, dword ptr [rcx+10H]
+       cmp      ebx, dword ptr [rsi+10H]
        jae      SHORT G_M45156_IG15
-                       ;; size=22 bbWeight=2 PerfScore 20.50
+                       ;; size=17 bbWeight=2 PerfScore 15.00
 G_M45156_IG11:
-       mov      rcx, gword ptr [rcx+08H]
-       mov      eax, edx
-       cmp      eax, dword ptr [rcx+08H]
+       mov      rcx, gword ptr [rsi+08H]
+       cmp      ebx, dword ptr [rcx+08H]
        jae      SHORT G_M45156_IG08
-       shl      rax, 4
+       mov      edx, ebx
+       shl      rdx, 4
                        ;; size=15 bbWeight=1 PerfScore 6.75
 G_M45156_IG12:
-       vmovdqu  xmm0, xmmword ptr [rcx+rax+10H]
-       vmovdqu  xmmword ptr [rbp-28H], xmm0
+       vmovdqu  xmm0, xmmword ptr [rcx+rdx+10H]
+       vmovdqu  xmmword ptr [rbp-38H], xmm0
                        ;; size=11 bbWeight=1 PerfScore 5.00
 G_M45156_IG13:
-       inc      edx
-       mov      dword ptr [rbp-30H], edx
+       inc      ebx
        jmp      SHORT G_M45156_IG09
-                       ;; size=7 bbWeight=1 PerfScore 3.25
+                       ;; size=4 bbWeight=1 PerfScore 2.25
 G_M45156_IG14:
-       cmp      esi, edi
+       cmp      edi, r14d
        jne      SHORT G_M45156_IG07

jakobbotsch commented 1 year ago

Investigating some current causes of regressions when enabling physical promotion by default.

(edit: handled by #87265)

aspnet.run.windows.x64.checked.mch:

+37 (+14.57%) : 18820.dasm - System.Collections.Concurrent.ConcurrentDictionary`2+Enumerator[System.__Canon,Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher+ChangeTokenInfo]:MoveNext():bool:this

```diff @@ -14,40 +14,44 @@ ;* V04 loc3 [V04 ] ( 0, 0 ) int -> zero-ref ; V05 loc4 [V05,T04] ( 3, 6 ) int -> rcx ; V06 OutArgs [V06 ] ( 1, 1 ) struct (32) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace" -; V07 tmp1 [V07,T06] ( 5, 5 ) struct (32) [rsp+28H] do-not-enreg[SF] must-init ld-addr-op "NewObj constructor temp" +; V07 tmp1 [V07,T06] ( 4, 4 ) struct (32) [rsp+48H] do-not-enreg[SF] must-init ld-addr-op "NewObj constructor temp" ;* V08 tmp2 [V08 ] ( 0, 0 ) long -> zero-ref "spilling helperCall" -; V09 tmp3 [V09,T10] ( 2, 2 ) ref -> rax class-hnd single-def "Inlining Arg" +; V09 tmp3 [V09,T11] ( 2, 2 ) ref -> rax class-hnd single-def "Inlining Arg" ;* V10 tmp4 [V10 ] ( 0, 0 ) struct (24) zero-ref "Inlining Arg" -;* V11 tmp5 [V11 ] ( 0, 0 ) struct (32) zero-ref do-not-enreg[S] "Inlining Arg" -; V12 tmp6 [V12,T11] ( 2, 1 ) ref -> rdx single-def V10.k__BackingField(offs=0x00) P-INDEP "field V10.k__BackingField (fldOffset=0x0)" -; V13 tmp7 [V13,T12] ( 2, 1 ) ref -> rcx single-def V10.k__BackingField(offs=0x08) P-INDEP "field V10.k__BackingField (fldOffset=0x8)" -; V14 tmp8 [V14,T13] ( 2, 1 ) ref -> r8 single-def V10.k__BackingField(offs=0x10) P-INDEP "field V10.k__BackingField (fldOffset=0x10)" -; V15 cse0 [V15,T07] ( 2, 4 ) int -> rax "CSE - aggressive" -;* V16 rat0 [V16,T09] ( 0, 0 ) long -> zero-ref "Spilling to split statement for tree" -;* V17 rat1 [V17,T14] ( 0, 0 ) long -> zero-ref "runtime lookup" -;* V18 rat2 [V18,T08] ( 0, 0 ) long -> zero-ref "fgMakeTemp is creating a new local variable" -; V19 rat3 [V19,T05] ( 3, 6 ) int -> rdx "ReplaceWithLclVar is creating a new local variable" +; V11 tmp5 [V11,T08] ( 3, 3 ) struct (32) [rsp+28H] do-not-enreg[S] must-init "Inlining Arg" +; V12 tmp6 [V12,T12] ( 2, 1 ) ref -> rdx single-def V10.k__BackingField(offs=0x00) P-INDEP "field V10.k__BackingField (fldOffset=0x0)" +; V13 tmp7 [V13,T13] ( 2, 1 ) ref -> rcx single-def V10.k__BackingField(offs=0x08) P-INDEP "field V10.k__BackingField (fldOffset=0x8)" +; V14 tmp8 [V14,T14] ( 2, 1 ) ref -> r8 single-def V10.k__BackingField(offs=0x10) P-INDEP "field V10.k__BackingField (fldOffset=0x10)" +;* V15 tmp9 [V15 ] ( 0, 0 ) ref -> zero-ref single-def "V07.[000..008)" +; V16 cse0 [V16,T07] ( 2, 4 ) int -> rax "CSE - aggressive" +;* V17 rat0 [V17,T10] ( 0, 0 ) long -> zero-ref "Spilling to split statement for tree" +;* V18 rat1 [V18,T15] ( 0, 0 ) long -> zero-ref "runtime lookup" +;* V19 rat2 [V19,T09] ( 0, 0 ) long -> zero-ref "fgMakeTemp is creating a new local variable" +; V20 rat3 [V20,T05] ( 3, 6 ) int -> rdx "ReplaceWithLclVar is creating a new local variable" ; -; Lcl frame size = 72 +; Lcl frame size = 104 G_M47209_IG01: ; bbWeight=1, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref, nogc <-- Prolog IG push rdi push rsi push rbp push rbx - sub rsp, 72 + sub rsp, 104 + vzeroupper xor eax, eax mov qword ptr [rsp+28H], rax vxorps xmm4, xmm4, xmm4 vmovdqa xmmword ptr [rsp+30H], xmm4 - mov qword ptr [rsp+40H], rax + vmovdqa xmmword ptr [rsp+40H], xmm4 + vmovdqa xmmword ptr [rsp+50H], xmm4 + mov qword ptr [rsp+60H], rax mov rbx, rcx ; gcrRegs +[rbx] - ;; size=33 bbWeight=1 PerfScore 9.08 + ;; size=48 bbWeight=1 PerfScore 14.08 G_M47209_IG02: ; bbWeight=1, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref mov edx, dword ptr [rbx+24H] cmp edx, 2 - ja G_M47209_IG08 + ja G_M47209_IG10 lea rcx, [reloc @RWD00] mov ecx, dword ptr [rcx+4*rdx] lea rax, G_M47209_IG02 @@ -66,7 +70,7 @@ G_M47209_IG03: ; bbWeight=0.50, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, ; byrRegs -[rcx] mov dword ptr [rbx+20H], -1 ;; size=28 bbWeight=0.50 PerfScore 4.25 -G_M47209_IG04: ; bbWeight=2, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref, isz +G_M47209_IG04: ; bbWeight=2, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref mov rdx, gword ptr [rbx+10H] ; gcrRegs +[rdx] mov ecx, dword ptr [rbx+20H] @@ -74,7 +78,7 @@ G_M47209_IG04: ; bbWeight=2, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byr mov dword ptr [rbx+20H], ecx mov eax, dword ptr [rdx+08H] cmp eax, ecx - jbe SHORT G_M47209_IG08 + jbe G_M47209_IG10 mov rdx, gword ptr [rdx+8*rcx+10H] lea rcx, bword ptr [rbx+18H] ; byrRegs +[rcx] @@ -82,7 +86,7 @@ G_M47209_IG04: ; bbWeight=2, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byr ; gcrRegs -[rdx] ; byrRegs -[rcx] mov dword ptr [rbx+24H], 2 - ;; size=40 bbWeight=2 PerfScore 26.00 + ;; size=44 bbWeight=2 PerfScore 26.00 G_M47209_IG05: ; bbWeight=4, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref, isz mov rbp, gword ptr [rbx+18H] ; gcrRegs +[rbp] @@ -98,10 +102,16 @@ G_M47209_IG06: ; bbWeight=0.50, gcrefRegs=0028 {rbx rbp}, byrefRegs=0000 ; gcrRegs +[rcx] mov r8, gword ptr [rbp+30H] ; gcrRegs +[r8] + mov gword ptr [rsp+50H], rdx + mov gword ptr [rsp+58H], rcx + mov gword ptr [rsp+60H], r8 + ;; size=31 bbWeight=0.50 PerfScore 5.50 +G_M47209_IG07: ; bbWeight=0.50, nogc, extend + vmovdqu ymm0, ymmword ptr [rsp+48H] + vmovdqu ymmword ptr [rsp+28H], ymm0 + ;; size=12 bbWeight=0.50 PerfScore 2.50 +G_M47209_IG08: ; bbWeight=0.50, extend mov gword ptr [rsp+28H], rax - mov gword ptr [rsp+30H], rdx - mov gword ptr [rsp+38H], rcx - mov gword ptr [rsp+40H], r8 lea rdi, bword ptr [rbx+28H] ; byrRegs +[rdi] lea rsi, bword ptr [rsp+28H] @@ -119,33 +129,35 @@ G_M47209_IG06: ; bbWeight=0.50, gcrefRegs=0028 {rbx rbp}, byrefRegs=0000 ; gcrRegs -[rdx rbp] ; byrRegs -[rcx rsi rdi] mov eax, 1 - ;; size=83 bbWeight=0.50 PerfScore 10.38 -G_M47209_IG07: ; bbWeight=0.50, epilog, nogc, extend - add rsp, 72 + ;; size=52 bbWeight=0.50 PerfScore 4.88 +G_M47209_IG09: ; bbWeight=0.50, epilog, nogc, extend + vzeroupper + add rsp, 104 pop rbx pop rbp pop rsi pop rdi ret - ;; size=9 bbWeight=0.50 PerfScore 1.62 -G_M47209_IG08: ; bbWeight=0.50, gcVars=0000000000000000 {}, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, gcvars, byref + ;; size=12 bbWeight=0.50 PerfScore 2.12 +G_M47209_IG10: ; bbWeight=0.50, gcVars=0000000000000000 {}, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, gcvars, byref mov dword ptr [rbx+24H], 3 xor eax, eax ;; size=9 bbWeight=0.50 PerfScore 0.62 -G_M47209_IG09: ; bbWeight=0.50, epilog, nogc, extend - add rsp, 72 +G_M47209_IG11: ; bbWeight=0.50, epilog, nogc, extend + vzeroupper + add rsp, 104 pop rbx pop rbp pop rsi pop rdi ret - ;; size=9 bbWeight=0.50 PerfScore 1.62 + ;; size=12 bbWeight=0.50 PerfScore 2.12 RWD00 dd G_M47209_IG03 - G_M47209_IG02 dd G_M47209_IG04 - G_M47209_IG02 dd G_M47209_IG05 - G_M47209_IG02 -; Total bytes of code 254, prolog size 33, PerfScore 100.98, instruction count 71, allocated bytes for code 254 (MethodHash=bed44796) for method System.Collections.Concurrent.ConcurrentDictionary`2+Enumerator[System.__Canon,Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher+ChangeTokenInfo]:MoveNext():bool:this +; Total bytes of code 291, prolog size 48, PerfScore 113.18, instruction count 78, allocated bytes for code 291 (MethodHash=bed44796) for method System.Collections.Concurrent.ConcurrentDictionary`2+Enumerator[System.__Canon,Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher+ChangeTokenInfo]:MoveNext():bool:this ; ============================================================ ``` Promotions: ```scala Accesses for V07 ref @ 000 #: (1, 100) # assigned from: (0, 0) # assigned to: (1, 100) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) [000..032) as System.Collections.Generic.KeyValuePair`2[System.__Canon,Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher+ChangeTokenInfo] #: (1, 100) # assigned from: (1, 100) # assigned to: (0, 0) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) [008..032) as Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher+ChangeTokenInfo #: (1, 100) # assigned from: (0, 0) # assigned to: (1, 100) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) Accesses for V11 [000..032) as System.Collections.Generic.KeyValuePair`2[System.__Canon,Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher+ChangeTokenInfo] #: (2, 200) # assigned from: (1, 100) # assigned to: (1, 100) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) Picking promotions for V07 Evaluating access ref @ 000 Single write-back cost: 3 Write backs: 0 Read backs: 0 Cost with: 50 Cost without: 300 Promoting replacement lvaGrabTemp returning 15 (V15 tmp9) (a long lifetime temp) called for V07.[000..008). V07 promoted with 1 replacements [000..008) promoted as ref V15 Computing unpromoted remainder for V07 Remainder: [008..032) ``` We end up with the following decomposition: ```scala STMT00025 ( 0x096[--] ... ??? ) [000107] DA--------- ▌ STORE_LCL_VAR struct V11 tmp5 [000036] ----------- └──▌ LCL_VAR struct V07 tmp1 (last use) Processing block operation [000107] that involves replacements dst+000 <- V15 (V07.[000..008)) (last use) Remainder: [008..032) => Remainder strategy: retain a full block op Local V11 should not be enregistered because: was accessed as a local field New statement: STMT00025 ( 0x096[--] ... ??? ) [000112] -A--------- ▌ COMMA void [000107] DA--------- ├──▌ STORE_LCL_VAR struct V11 tmp5 [000036] ----------- │ └──▌ LCL_VAR struct V07 tmp1 [000111] UA--------- └──▌ STORE_LCL_FLD ref V11 tmp5 [+0] [000110] ----------- └──▌ LCL_VAR ref V15 tmp9 (last use) ``` However, after `STMT00025` there was a last use of `V11` which we then are no longer able to forward sub: ```diff - [000107]: [000104] is last use of [000107] (V11) -- fwd subbing [000036]; new next stmt is -STMT00024 ( INL02 @ 0x000[E-] ... ??? ) <- INLRT @ 0x096[--] - [000106] nA-XG------ ▌ STORE_BLK struct (copy) - [000105] ---X------- ├──▌ FIELD_ADDR byref : - [000020] ----------- │ └──▌ LCL_VAR ref V00 this - [000036] ----------- └──▌ LCL_VAR struct V07 tmp1 (last use) - -removing useless STMT00025 ( 0x096[--] ... ??? ) - [000107] DA--------- ▌ STORE_LCL_VAR struct V11 tmp5 - [000036] ----------- └──▌ LCL_VAR struct V07 tmp1 (last use) - from BB07 ``` It would be possible to look ahead to try to predict this situation and then handle the store by writing back to V07 ahead of it instead. Alternatively we could also run forward sub before physical promotion.

jakobbotsch commented 1 year ago

(edit: partially handled by #87217, rest will be handled by #87410)

+21 (+18.10%) : 90439.dasm - Microsoft.EntityFrameworkCore.Infrastructure.Uniquifier:GetLength(System.Nullable`1[int]):int

```diff @@ -7,79 +7,88 @@ ; 0 inlinees with PGO data; 4 single block inlinees; 1 inlinees without PGO data ; Final local variable assignments ; -; V00 arg0 [V00,T00] ( 9, 27 ) struct ( 8) [rsp+30H] do-not-enreg[SF] ld-addr-op single-def -; V01 loc0 [V01,T02] ( 4, 9 ) int -> rcx -;* V02 loc1 [V02 ] ( 0, 0 ) struct ( 8) zero-ref ld-addr-op +; V00 arg0 [V00,T01] ( 5, 11 ) struct ( 8) [rsp+40H] do-not-enreg[SF] ld-addr-op single-def +; V01 loc0 [V01,T06] ( 4, 9 ) int -> r8 +; V02 loc1 [V02 ] ( 4, 14 ) struct ( 8) [rsp+30H] do-not-enreg[SF] must-init ld-addr-op ;* V03 loc2 [V03 ] ( 0, 0 ) struct ( 8) zero-ref ld-addr-op ; V04 OutArgs [V04 ] ( 1, 1 ) struct (32) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace" ;* V05 tmp1 [V05 ] ( 0, 0 ) struct ( 8) zero-ref ld-addr-op "NewObj constructor temp" -;* V06 tmp2 [V06 ] ( 0, 0 ) struct ( 8) zero-ref +; V06 tmp2 [V06 ] ( 5, 14 ) struct ( 8) [rsp+28H] do-not-enreg[SF] must-init ;* V07 tmp3 [V07 ] ( 0, 0 ) int -> zero-ref "Inlining Arg" -; V08 tmp4 [V08,T05] ( 2, 8 ) bool -> rax V02.hasValue(offs=0x00) P-INDEP "field V02.hasValue (fldOffset=0x0)" -; V09 tmp5 [V09,T06] ( 2, 6 ) int -> rdx V02.value(offs=0x04) P-INDEP "field V02.value (fldOffset=0x4)" +; V08 tmp4 [V08,T03] ( 3, 12 ) bool -> [rsp+30H] do-not-enreg[] V02.hasValue(offs=0x00) P-DEP "field V02.hasValue (fldOffset=0x0)" +; V09 tmp5 [V09,T08] ( 2, 6 ) int -> [rsp+34H] do-not-enreg[] V02.value(offs=0x04) P-DEP "field V02.value (fldOffset=0x4)" ;* V10 tmp6 [V10 ] ( 0, 0 ) bool -> zero-ref V03.hasValue(offs=0x00) P-INDEP "field V03.hasValue (fldOffset=0x0)" ;* V11 tmp7 [V11 ] ( 0, 0 ) int -> zero-ref V03.value(offs=0x04) P-INDEP "field V03.value (fldOffset=0x4)" -;* V12 tmp8 [V12,T08] ( 0, 0 ) bool -> zero-ref V05.hasValue(offs=0x00) P-INDEP "field V05.hasValue (fldOffset=0x0)" -; V13 tmp9 [V13,T07] ( 2, 4 ) int -> r9 V05.value(offs=0x04) P-INDEP "field V05.value (fldOffset=0x4)" -; V14 tmp10 [V14,T03] ( 3, 8 ) bool -> r8 V06.hasValue(offs=0x00) P-INDEP "field V06.hasValue (fldOffset=0x0)" -; V15 tmp11 [V15,T04] ( 3, 8 ) int -> r9 V06.value(offs=0x04) P-INDEP "field V06.value (fldOffset=0x4)" -; V16 rat0 [V16,T01] ( 3, 12 ) int -> rdx "ReplaceWithLclVar is creating a new local variable" +;* V12 tmp8 [V12,T10] ( 0, 0 ) bool -> zero-ref V05.hasValue(offs=0x00) P-INDEP "field V05.hasValue (fldOffset=0x0)" +; V13 tmp9 [V13,T09] ( 2, 4 ) int -> rdx V05.value(offs=0x04) P-INDEP "field V05.value (fldOffset=0x4)" +; V14 tmp10 [V14,T02] ( 4, 12 ) bool -> [rsp+28H] do-not-enreg[] V06.hasValue(offs=0x00) P-DEP "field V06.hasValue (fldOffset=0x0)" +; V15 tmp11 [V15,T07] ( 3, 8 ) int -> [rsp+2CH] do-not-enreg[] V06.value(offs=0x04) P-DEP "field V06.value (fldOffset=0x4)" +; V16 tmp12 [V16,T00] ( 5, 14 ) bool -> rcx "V00.[000..001)" +; V17 cse0 [V17,T04] ( 3, 12 ) int -> rax "CSE - aggressive" +; V18 rat0 [V18,T05] ( 3, 12 ) int -> rdx "ReplaceWithLclVar is creating a new local variable" ; -; Lcl frame size = 40 +; Lcl frame size = 56 G_M24602_IG01: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG - sub rsp, 40 - mov qword ptr [rsp+30H], rcx - ;; size=9 bbWeight=1 PerfScore 1.25 + sub rsp, 56 + xor eax, eax + mov qword ptr [rsp+30H], rax + mov qword ptr [rsp+28H], rax + mov qword ptr [rsp+40H], rcx + ;; size=21 bbWeight=1 PerfScore 3.50 G_M24602_IG02: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz - cmp byte ptr [rsp+30H], 0 + movzx rcx, byte ptr [rsp+40H] + test ecx, ecx jne SHORT G_M24602_IG05 - ;; size=7 bbWeight=1 PerfScore 3.00 + ;; size=9 bbWeight=1 PerfScore 2.25 G_M24602_IG03: ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref xor eax, eax ;; size=2 bbWeight=0.50 PerfScore 0.12 G_M24602_IG04: ; bbWeight=0.50, epilog, nogc, extend - add rsp, 40 + add rsp, 56 ret ;; size=5 bbWeight=0.50 PerfScore 0.62 G_M24602_IG05: ; bbWeight=0.50, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref - xor ecx, ecx - ;; size=2 bbWeight=0.50 PerfScore 0.12 -G_M24602_IG06: ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz - movzx rax, byte ptr [rsp+30H] - mov edx, dword ptr [rsp+34H] - test al, al - jne SHORT G_M24602_IG08 - ;; size=13 bbWeight=4 PerfScore 13.00 -G_M24602_IG07: ; bbWeight=2, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz xor r8d, r8d - xor r9d, r9d - jmp SHORT G_M24602_IG09 - ;; size=8 bbWeight=2 PerfScore 5.00 -G_M24602_IG08: ; bbWeight=2, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref - mov r8d, 0xD1FFAB1E - mov eax, r8d - imul edx:eax, edx - mov r9d, edx - shr r9d, 31 - sar edx, 2 - add r9d, edx - mov r8d, 1 - ;; size=30 bbWeight=2 PerfScore 10.50 -G_M24602_IG09: ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz - mov byte ptr [rsp+30H], r8b - mov dword ptr [rsp+34H], r9d - inc ecx + ;; size=3 bbWeight=0.50 PerfScore 0.12 +G_M24602_IG06: ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz + mov byte ptr [rsp+30H], cl + mov eax, dword ptr [rsp+44H] + mov dword ptr [rsp+34H], eax cmp byte ptr [rsp+30H], 0 + jne SHORT G_M24602_IG08 + ;; size=19 bbWeight=4 PerfScore 24.00 +G_M24602_IG07: ; bbWeight=2, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz + xor eax, eax + mov qword ptr [rsp+28H], rax + jmp SHORT G_M24602_IG09 + ;; size=9 bbWeight=2 PerfScore 6.50 +G_M24602_IG08: ; bbWeight=2, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref + mov edx, 0xD1FFAB1E + mov eax, edx + imul edx:eax, dword ptr [rsp+34H] + mov ecx, edx + shr ecx, 31 + sar edx, 2 + add edx, ecx + mov byte ptr [rsp+28H], 1 + mov dword ptr [rsp+2CH], edx + ;; size=30 bbWeight=2 PerfScore 18.00 +G_M24602_IG09: ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz + movzx rcx, byte ptr [rsp+28H] + mov eax, dword ptr [rsp+2CH] + mov dword ptr [rsp+44H], eax + inc r8d + test ecx, ecx je SHORT G_M24602_IG12 - cmp dword ptr [rsp+34H], 0 + test eax, eax jg SHORT G_M24602_IG06 - ;; size=26 bbWeight=4 PerfScore 33.00 + ;; size=24 bbWeight=4 PerfScore 23.00 G_M24602_IG10: ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref - mov eax, ecx - ;; size=2 bbWeight=0.50 PerfScore 0.12 + mov eax, r8d + ;; size=3 bbWeight=0.50 PerfScore 0.12 G_M24602_IG11: ; bbWeight=0.50, epilog, nogc, extend - add rsp, 40 + add rsp, 56 ret ;; size=5 bbWeight=0.50 PerfScore 0.62 G_M24602_IG12: ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref @@ -88,7 +97,7 @@ G_M24602_IG12: ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=0000 { int3 ;; size=7 bbWeight=0 PerfScore 0.00 -; Total bytes of code 116, prolog size 9, PerfScore 78.98, instruction count 35, allocated bytes for code 116 (MethodHash=90769fe5) for method Microsoft.EntityFrameworkCore.Infrastructure.Uniquifier:GetLength(System.Nullable`1[int]):int +; Total bytes of code 137, prolog size 21, PerfScore 92.58, instruction count 42, allocated bytes for code 137 (MethodHash=90769fe5) for method Microsoft.EntityFrameworkCore.Infrastructure.Uniquifier:GetLength(System.Nullable`1[int]):int ; ============================================================ ``` Promotions: ```scala Accesses for V00 bool @ 000 #: (2, 200) # assigned from: (0, 0) # assigned to: (0, 0) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) [000..008) as System.Nullable`1[int] #: (2, 200) # assigned from: (1, 100) # assigned to: (1, 100) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) int @ 004 #: (1, 100) # assigned from: (0, 0) # assigned to: (0, 0) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) Picking promotions for V00 Evaluating access bool @ 000 Single write-back cost: 3 Write backs: 0 Read backs: 100 Cost with: 400 Cost without: 600 Promoting replacement lvaGrabTemp returning 16 (V16 tmp12) (a long lifetime temp) called for V00.[000..001). Evaluating access int @ 004 Single write-back cost: 3 Write backs: 0 Read backs: 100 Cost with: 350 Cost without: 300 Disqualifying replacement V00 promoted with 1 replacements [000..001) promoted as bool V16 Computing unpromoted remainder for V00 Remainder: [004..008) ``` Two problems and a comment here: 1. Promoting `V00.[004..008]` would be very beneficial because the assignments that `V00` are used in are to another struct with a correspondingly promoted field, e.g.: ```scala STMT00003 ( 0x00D[E-] ... 0x00E ) [000009] DA--------- ▌ STORE_LCL_VAR struct(P) V02 loc1 ▌ bool V02.:hasValue (offs=0x00) -> V08 tmp4 ▌ int V02.:value (offs=0x04) -> V09 tmp5 [000008] ----------- └──▌ LCL_VAR struct V00 arg0 (last use) ``` The extra promotion would allow much cleaner decomposition. We could do something simple and assume that overlapping struct assignments would have their cost decreased a bit by promoting fields; we could also do something smarter and track all assigned locals in a union-find data structure, which will allow us to query the sets of structs for which it would be smart to promote fields together. 2. We are missing handling in decomposition when copying between a physically promoted remainder and a field of a regularly promoted struct: ```scala Processing block operation [000009] that involves replacements V08 (field V02.hasValue (fldOffset=0x0)) <- V16 (V00.[000..001)) (last use) Remainder: [004..008) => Remainder strategy: int at +004 Local V00 should not be enregistered because: was accessed as a local field Local V02 should not be enregistered because: was accessed as a local field New statement: STMT00003 ( 0x00D[E-] ... 0x00E ) [000090] -A--------- ▌ COMMA void [000087] DA--------- ├──▌ STORE_LCL_VAR bool V08 tmp4 [000086] ----------- │ └──▌ LCL_VAR bool V16 tmp12 (last use) [000089] UA--------- └──▌ STORE_LCL_FLD int (P) V02 loc1 [+4] ▌ bool V02.:hasValue (offs=0x00) -> V08 tmp4 ▌ int V02.:value (offs=0x04) -> V09 tmp5 [000088] ----------- └──▌ LCL_FLD int V00 arg0 [+4] ``` The expected decomposition should be to `V09` directly. Should be an easy fix. (edit: handled by #87217) 3. We do not regularly promote `V00` because it is a parameter whose field does not fit cleanly into the register it is passed in. If we do force physical promotion to promote `V00.[004..008)` then we end up with: ```diff @@ -7,98 +7,93 @@ ; 0 inlinees with PGO data; 4 single block inlinees; 1 inlinees without PGO data ; Final local variable assignments ; -; V00 arg0 [V00,T00] ( 9, 27 ) struct ( 8) [rsp+30H] do-not-enreg[SF] ld-addr-op single-def -; V01 loc0 [V01,T02] ( 4, 9 ) int -> rcx +; V00 arg0 [V00,T06] ( 4, 4 ) struct ( 8) [rsp+30H] do-not-enreg[SF] ld-addr-op single-def +; V01 loc0 [V01,T03] ( 4, 9 ) int -> r8 ;* V02 loc1 [V02 ] ( 0, 0 ) struct ( 8) zero-ref ld-addr-op ;* V03 loc2 [V03 ] ( 0, 0 ) struct ( 8) zero-ref ld-addr-op ; V04 OutArgs [V04 ] ( 1, 1 ) struct (32) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace" ;* V05 tmp1 [V05 ] ( 0, 0 ) struct ( 8) zero-ref ld-addr-op "NewObj constructor temp" ;* V06 tmp2 [V06 ] ( 0, 0 ) struct ( 8) zero-ref ;* V07 tmp3 [V07 ] ( 0, 0 ) int -> zero-ref "Inlining Arg" -; V08 tmp4 [V08,T05] ( 2, 8 ) bool -> rax V02.hasValue(offs=0x00) P-INDEP "field V02.hasValue (fldOffset=0x0)" -; V09 tmp5 [V09,T06] ( 2, 6 ) int -> rdx V02.value(offs=0x04) P-INDEP "field V02.value (fldOffset=0x4)" +;* V08 tmp4 [V08 ] ( 0, 0 ) bool -> zero-ref V02.hasValue(offs=0x00) P-INDEP "field V02.hasValue (fldOffset=0x0)" +; V09 tmp5 [V09,T07] ( 2, 6 ) int -> rdx V02.value(offs=0x04) P-INDEP "field V02.value (fldOffset=0x4)" ;* V10 tmp6 [V10 ] ( 0, 0 ) bool -> zero-ref V03.hasValue(offs=0x00) P-INDEP "field V03.hasValue (fldOffset=0x0)" ;* V11 tmp7 [V11 ] ( 0, 0 ) int -> zero-ref V03.value(offs=0x04) P-INDEP "field V03.value (fldOffset=0x4)" -;* V12 tmp8 [V12,T08] ( 0, 0 ) bool -> zero-ref V05.hasValue(offs=0x00) P-INDEP "field V05.hasValue (fldOffset=0x0)" -; V13 tmp9 [V13,T07] ( 2, 4 ) int -> r9 V05.value(offs=0x04) P-INDEP "field V05.value (fldOffset=0x4)" -; V14 tmp10 [V14,T03] ( 3, 8 ) bool -> r8 V06.hasValue(offs=0x00) P-INDEP "field V06.hasValue (fldOffset=0x0)" -; V15 tmp11 [V15,T04] ( 3, 8 ) int -> r9 V06.value(offs=0x04) P-INDEP "field V06.value (fldOffset=0x4)" -; V16 rat0 [V16,T01] ( 3, 12 ) int -> rdx "ReplaceWithLclVar is creating a new local variable" +;* V12 tmp8 [V12,T09] ( 0, 0 ) bool -> zero-ref V05.hasValue(offs=0x00) P-INDEP "field V05.hasValue (fldOffset=0x0)" +; V13 tmp9 [V13,T08] ( 2, 4 ) int -> rdx V05.value(offs=0x04) P-INDEP "field V05.value (fldOffset=0x4)" +; V14 tmp10 [V14,T04] ( 3, 8 ) bool -> rcx V06.hasValue(offs=0x00) P-INDEP "field V06.hasValue (fldOffset=0x0)" +; V15 tmp11 [V15,T05] ( 3, 8 ) int -> rdx V06.value(offs=0x04) P-INDEP "field V06.value (fldOffset=0x4)" +; V16 tmp12 [V16,T00] ( 5, 14 ) bool -> rcx "V00.[000..001)" +; V17 tmp13 [V17,T01] ( 4, 13 ) int -> rdx "V00.[004..008)" +; V18 rat0 [V18,T02] ( 3, 12 ) int -> rdx "ReplaceWithLclVar is creating a new local variable" ; ; Lcl frame size = 40 -G_M24602_IG01: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG +G_M24602_IG01: ;; offset=0000H sub rsp, 40 mov qword ptr [rsp+30H], rcx ;; size=9 bbWeight=1 PerfScore 1.25 -G_M24602_IG02: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz - cmp byte ptr [rsp+30H], 0 +G_M24602_IG02: ;; offset=0009H + movzx rcx, byte ptr [rsp+30H] + mov edx, dword ptr [rsp+34H] + test ecx, ecx jne SHORT G_M24602_IG05 - ;; size=7 bbWeight=1 PerfScore 3.00 -G_M24602_IG03: ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref + ;; size=13 bbWeight=1 PerfScore 3.25 +G_M24602_IG03: ;; offset=0016H xor eax, eax ;; size=2 bbWeight=0.50 PerfScore 0.12 -G_M24602_IG04: ; bbWeight=0.50, epilog, nogc, extend +G_M24602_IG04: ;; offset=0018H add rsp, 40 ret ;; size=5 bbWeight=0.50 PerfScore 0.62 -G_M24602_IG05: ; bbWeight=0.50, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref - xor ecx, ecx - ;; size=2 bbWeight=0.50 PerfScore 0.12 -G_M24602_IG06: ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz - movzx rax, byte ptr [rsp+30H] - mov edx, dword ptr [rsp+34H] - test al, al - jne SHORT G_M24602_IG08 - ;; size=13 bbWeight=4 PerfScore 13.00 -G_M24602_IG07: ; bbWeight=2, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz +G_M24602_IG05: ;; offset=001DH xor r8d, r8d - xor r9d, r9d + align [0 bytes for IG06] + ;; size=3 bbWeight=0.50 PerfScore 0.12 +G_M24602_IG06: ;; offset=0020H + test ecx, ecx + jne SHORT G_M24602_IG08 + ;; size=4 bbWeight=4 PerfScore 5.00 +G_M24602_IG07: ;; offset=0024H + xor ecx, ecx + xor edx, edx jmp SHORT G_M24602_IG09 - ;; size=8 bbWeight=2 PerfScore 5.00 -G_M24602_IG08: ; bbWeight=2, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref - mov r8d, 0xD1FFAB1E - mov eax, r8d - imul edx:eax, edx - mov r9d, edx - shr r9d, 31 - sar edx, 2 - add r9d, edx - mov r8d, 1 - ;; size=30 bbWeight=2 PerfScore 10.50 -G_M24602_IG09: ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz - mov byte ptr [rsp+30H], r8b - mov dword ptr [rsp+34H], r9d - inc ecx - cmp byte ptr [rsp+30H], 0 - je SHORT G_M24602_IG12 - cmp dword ptr [rsp+34H], 0 - jg SHORT G_M24602_IG06 - ;; size=26 bbWeight=4 PerfScore 33.00 -G_M24602_IG10: ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref + ;; size=6 bbWeight=2 PerfScore 5.00 +G_M24602_IG08: ;; offset=002AH + mov ecx, 0x66666667 mov eax, ecx - ;; size=2 bbWeight=0.50 PerfScore 0.12 -G_M24602_IG11: ; bbWeight=0.50, epilog, nogc, extend + imul edx:eax, edx + mov eax, edx + shr eax, 31 + sar edx, 2 + add edx, eax + mov ecx, 1 + ;; size=24 bbWeight=2 PerfScore 10.50 +G_M24602_IG09: ;; offset=0042H + movzx rcx, cl + inc r8d + test ecx, ecx + je SHORT G_M24602_IG12 + test edx, edx + jg SHORT G_M24602_IG06 + ;; size=14 bbWeight=4 PerfScore 12.00 +G_M24602_IG10: ;; offset=0050H + mov eax, r8d + ;; size=3 bbWeight=0.50 PerfScore 0.12 +G_M24602_IG11: ;; offset=0053H add rsp, 40 ret ;; size=5 bbWeight=0.50 PerfScore 0.62 -G_M24602_IG12: ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref +G_M24602_IG12: ;; offset=0058H call [System.ThrowHelper:ThrowInvalidOperationException_InvalidOperation_NoValue()] - ; gcr arg pop 0 int3 ;; size=7 bbWeight=0 PerfScore 0.00 -; Total bytes of code 116, prolog size 9, PerfScore 78.98, instruction count 35, allocated bytes for code 116 (MethodHash=90769fe5) for method Microsoft.EntityFrameworkCore.Infrastructure.Uniquifier:GetLength(System.Nullable`1[int]):int +; Total bytes of code 95, prolog size 9, PerfScore 48.13, instruction count 35, allocated bytes for code 95 (MethodHash=90769fe5) for method Microsoft.EntityFrameworkCore.Infrastructure.Uniquifier:GetLength(System.Nullable`1[int]):int ``` which is smaller code and much better perf score.

jakobbotsch commented 1 year ago

(edit: not expected to be handled)

+16 (+21.33%) : 21081.dasm - System.Numerics.Tests.Perf_Matrix3x2:IsIdentityBenchmark():bool:this

```diff @@ -8,7 +8,7 @@ ; Final local variable assignments ; ;* V00 this [V00 ] ( 0, 0 ) ref -> zero-ref this class-hnd single-def -;* V01 loc0 [V01,T01] ( 0, 0 ) struct (24) zero-ref do-not-enreg[SF] ld-addr-op +;* V01 loc0 [V01 ] ( 0, 0 ) struct (24) zero-ref do-not-enreg[SF] ld-addr-op ;# V02 OutArgs [V02 ] ( 1, 1 ) struct ( 0) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace" ;* V03 tmp1 [V03 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline stloc first use temp" ;* V04 tmp2 [V04 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline ldloca(s) first use temp" @@ -16,11 +16,14 @@ ;* V06 tmp4 [V06 ] ( 0, 0 ) simd8 -> zero-ref V03.X(offs=0x00) P-INDEP "field V03.X (fldOffset=0x0)" ;* V07 tmp5 [V07 ] ( 0, 0 ) simd8 -> zero-ref V03.Y(offs=0x08) P-INDEP "field V03.Y (fldOffset=0x8)" ;* V08 tmp6 [V08 ] ( 0, 0 ) simd8 -> zero-ref V03.Z(offs=0x10) P-INDEP "field V03.Z (fldOffset=0x10)" -;* V09 tmp7 [V09,T04] ( 0, 0 ) simd8 -> zero-ref single-def V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)" -;* V10 tmp8 [V10,T05] ( 0, 0 ) simd8 -> zero-ref single-def V04.Y(offs=0x08) P-INDEP "field V04.Y (fldOffset=0x8)" -;* V11 tmp9 [V11,T06] ( 0, 0 ) simd8 -> zero-ref single-def V04.Z(offs=0x10) P-INDEP "field V04.Z (fldOffset=0x10)" -; V12 cse0 [V12,T02] ( 3, 3 ) simd8 -> mm0 "CSE - aggressive" -; V13 cse1 [V13,T03] ( 3, 2 ) simd8 -> mm1 "CSE - aggressive" +;* V09 tmp7 [V09,T03] ( 0, 0 ) simd8 -> zero-ref single-def V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)" +;* V10 tmp8 [V10,T04] ( 0, 0 ) simd8 -> zero-ref single-def V04.Y(offs=0x08) P-INDEP "field V04.Y (fldOffset=0x8)" +;* V11 tmp9 [V11,T05] ( 0, 0 ) simd8 -> zero-ref single-def V04.Z(offs=0x10) P-INDEP "field V04.Z (fldOffset=0x10)" +;* V12 tmp10 [V12 ] ( 0, 0 ) simd8 -> zero-ref single-def "V01.[000..008)" +;* V13 tmp11 [V13,T06] ( 0, 0 ) simd8 -> zero-ref single-def "V01.[008..016)" +;* V14 tmp12 [V14,T07] ( 0, 0 ) simd8 -> zero-ref single-def "V01.[016..024)" +; V15 cse0 [V15,T01] ( 2, 2 ) simd8 -> mm0 "CSE - aggressive" +; V16 cse1 [V16,T02] ( 2, 1.50) simd8 -> mm1 "CSE - aggressive" ; ; Lcl frame size = 0 @@ -30,12 +33,14 @@ G_M64376_IG01: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, G_M64376_IG02: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz vmovsd xmm0, qword ptr [reloc @RWD00] vmovsd xmm1, qword ptr [reloc @RWD08] - vcmpps k1, xmm0, xmm0, 4 + vmovsd xmm2, qword ptr [reloc @RWD00] + vcmpps k1, xmm0, xmm2, 4 kortestb k1, k1 jne SHORT G_M64376_IG04 - ;; size=29 bbWeight=1 PerfScore 11.00 + ;; size=37 bbWeight=1 PerfScore 14.00 G_M64376_IG03: ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz - vcmpps k1, xmm1, xmm1, 4 + vmovsd xmm0, qword ptr [reloc @RWD08] + vcmpps k1, xmm1, xmm0, 4 kortestb k1, k1 jne SHORT G_M64376_IG04 vxorps xmm0, xmm0, xmm0 @@ -45,7 +50,7 @@ G_M64376_IG03: ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byr sete al movzx rax, al jmp SHORT G_M64376_IG05 - ;; size=40 bbWeight=0.50 PerfScore 6.46 + ;; size=48 bbWeight=0.50 PerfScore 7.96 G_M64376_IG04: ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref xor eax, eax ;; size=2 bbWeight=0.50 PerfScore 0.12 @@ -56,7 +61,7 @@ RWD00 dq 000000003F800000h RWD08 dq 3F80000000000000h -; Total bytes of code 75, prolog size 3, PerfScore 27.68, instruction count 18, allocated bytes for code 81 (MethodHash=42890487) for method System.Numerics.Tests.Perf_Matrix3x2:IsIdentityBenchmark():bool:this +; Total bytes of code 91, prolog size 3, PerfScore 33.78, instruction count 20, allocated bytes for code 97 (MethodHash=42890487) for method System.Numerics.Tests.Perf_Matrix3x2:IsIdentityBenchmark():bool:this ``` Replacements: ```scala Accesses for V01 [000..024) as System.Numerics.Matrix3x2 #: (1, 100) # assigned from: (0, 0) # assigned to: (1, 100) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) simd8 @ 000 #: (1, 100) # assigned from: (0, 0) # assigned to: (0, 0) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) simd8 @ 008 #: (1, 100) # assigned from: (0, 0) # assigned to: (0, 0) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) simd8 @ 016 #: (1, 100) # assigned from: (0, 0) # assigned to: (0, 0) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) Picking promotions for V01 Evaluating access simd8 @ 000 Single write-back cost: 3 Write backs: 0 Read backs: 0 Cost with: 50 Cost without: 300 Promoting replacement lvaGrabTemp returning 12 (V12 tmp10) (a long lifetime temp) called for V01.[000..008). Evaluating access simd8 @ 008 Single write-back cost: 3 Write backs: 0 Read backs: 0 Cost with: 50 Cost without: 300 Promoting replacement lvaGrabTemp returning 13 (V13 tmp11) (a long lifetime temp) called for V01.[008..016). Evaluating access simd8 @ 016 Single write-back cost: 3 Write backs: 0 Read backs: 0 Cost with: 50 Cost without: 300 Promoting replacement lvaGrabTemp returning 14 (V14 tmp12) (a long lifetime temp) called for V01.[016..024). V01 promoted with 3 replacements [000..008) promoted as simd8 V12 [008..016) promoted as simd8 V13 [016..024) promoted as simd8 V14 Computing unpromoted remainder for V01 Remainder: ``` Physical promotion means we replace a `LCL_FLD` with `LCL_VAR`. VN proves these to be a vector constant, but CSE does not kick in anymore due to the LCL_VAR, and then constant prop ends up creating some more copies of the vector constant. I think the issue is essentially #70182 as LSRA could probably realize and reuse the existing register that already contains the constant.

jakobbotsch commented 1 year ago

(edit: tracked by #87554)

+18 (+21.95%) : 17866.dasm - System.Numerics.Tests.Perf_Matrix3x2:GetDeterminantBenchmark():float:this

```diff @@ -8,16 +8,20 @@ ; Final local variable assignments ; ;* V00 this [V00 ] ( 0, 0 ) ref -> zero-ref this class-hnd single-def -; V01 loc0 [V01,T00] ( 6, 6 ) struct (24) [rsp+00H] do-not-enreg[SF] must-init ld-addr-op +;* V01 loc0 [V01 ] ( 0, 0 ) struct (24) zero-ref do-not-enreg[SF] ld-addr-op ;# V02 OutArgs [V02 ] ( 1, 1 ) struct ( 0) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace" ;* V03 tmp1 [V03 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline stloc first use temp" -;* V04 tmp2 [V04 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline ldloca(s) first use temp" +; V04 tmp2 [V04 ] ( 7, 7 ) struct (24) [rsp+00H] do-not-enreg[SF] must-init ld-addr-op "Inline ldloca(s) first use temp" ;* V05 tmp3 [V05 ] ( 0, 0 ) simd8 -> zero-ref V03.X(offs=0x00) P-INDEP "field V03.X (fldOffset=0x0)" ;* V06 tmp4 [V06 ] ( 0, 0 ) simd8 -> zero-ref V03.Y(offs=0x08) P-INDEP "field V03.Y (fldOffset=0x8)" ;* V07 tmp5 [V07 ] ( 0, 0 ) simd8 -> zero-ref V03.Z(offs=0x10) P-INDEP "field V03.Z (fldOffset=0x10)" -;* V08 tmp6 [V08,T01] ( 0, 0 ) simd8 -> zero-ref single-def V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)" -;* V09 tmp7 [V09,T02] ( 0, 0 ) simd8 -> zero-ref single-def V04.Y(offs=0x08) P-INDEP "field V04.Y (fldOffset=0x8)" -;* V10 tmp8 [V10,T03] ( 0, 0 ) simd8 -> zero-ref single-def V04.Z(offs=0x10) P-INDEP "field V04.Z (fldOffset=0x10)" +; V08 tmp6 [V08,T00] ( 5, 5 ) simd8 -> [rsp+00H] do-not-enreg[S] single-def V04.X(offs=0x00) P-DEP "field V04.X (fldOffset=0x0)" +; V09 tmp7 [V09,T01] ( 5, 5 ) simd8 -> [rsp+08H] do-not-enreg[S] single-def V04.Y(offs=0x08) P-DEP "field V04.Y (fldOffset=0x8)" +; V10 tmp8 [V10,T02] ( 5, 5 ) simd8 -> [rsp+10H] do-not-enreg[S] single-def V04.Z(offs=0x10) P-DEP "field V04.Z (fldOffset=0x10)" +; V11 tmp9 [V11,T03] ( 2, 2 ) float -> mm0 single-def "V01.[000..004)" +; V12 tmp10 [V12,T04] ( 2, 2 ) float -> mm1 single-def "V01.[004..008)" +; V13 tmp11 [V13,T05] ( 2, 2 ) float -> mm2 single-def "V01.[008..012)" +; V14 tmp12 [V14,T06] ( 2, 2 ) float -> mm3 single-def "V01.[012..016)" ; ; Lcl frame size = 24 @@ -34,12 +38,16 @@ G_M33935_IG02: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref vmovsd qword ptr [rsp], xmm0 vmovsd xmm0, qword ptr [reloc @RWD08] vmovsd qword ptr [rsp+08H], xmm0 + vxorps xmm0, xmm0, xmm0 + vmovsd qword ptr [rsp+10H], xmm0 vmovss xmm0, dword ptr [rsp] - vmulss xmm0, xmm0, dword ptr [rsp+0CH] - vmovss xmm1, dword ptr [rsp+08H] - vmulss xmm1, xmm1, dword ptr [rsp+04H] + vmovss xmm1, dword ptr [rsp+04H] + vmovss xmm2, dword ptr [rsp+08H] + vmovss xmm3, dword ptr [rsp+0CH] + vmulss xmm0, xmm0, xmm3 + vmulss xmm1, xmm2, xmm1 vsubss xmm0, xmm0, xmm1 - ;; size=54 bbWeight=1 PerfScore 27.00 + ;; size=72 bbWeight=1 PerfScore 30.33 G_M33935_IG03: ; bbWeight=1, epilog, nogc, extend add rsp, 24 ret @@ -48,7 +56,7 @@ RWD00 dq 000000003F800000h RWD08 dq 3F80000000000000h -; Total bytes of code 82, prolog size 23, PerfScore 41.28, instruction count 17, allocated bytes for code 82 (MethodHash=96117b70) for method System.Numerics.Tests.Perf_Matrix3x2:GetDeterminantBenchmark():float:this +; Total bytes of code 100, prolog size 23, PerfScore 46.42, instruction count 21, allocated bytes for code 100 (MethodHash=96117b70) for method System.Numerics.Tests.Perf_Matrix3x2:GetDeterminantBenchmark():float:this ``` Replacements: ```scala Accesses for V01 [000..024) as System.Numerics.Matrix3x2 #: (1, 100) # assigned from: (0, 0) # assigned to: (1, 100) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) float @ 000 #: (1, 100) # assigned from: (0, 0) # assigned to: (0, 0) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) float @ 004 #: (1, 100) # assigned from: (0, 0) # assigned to: (0, 0) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) float @ 008 #: (1, 100) # assigned from: (0, 0) # assigned to: (0, 0) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) float @ 012 #: (1, 100) # assigned from: (0, 0) # assigned to: (0, 0) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) Picking promotions for V01 Evaluating access float @ 000 Single write-back cost: 3 Write backs: 0 Read backs: 0 Cost with: 50 Cost without: 300 Promoting replacement lvaGrabTemp returning 11 (V11 tmp9) (a long lifetime temp) called for V01.[000..004). Evaluating access float @ 004 Single write-back cost: 3 Write backs: 0 Read backs: 0 Cost with: 50 Cost without: 300 Promoting replacement lvaGrabTemp returning 12 (V12 tmp10) (a long lifetime temp) called for V01.[004..008). Evaluating access float @ 008 Single write-back cost: 3 Write backs: 0 Read backs: 0 Cost with: 50 Cost without: 300 Promoting replacement lvaGrabTemp returning 13 (V13 tmp11) (a long lifetime temp) called for V01.[008..012). Evaluating access float @ 012 Single write-back cost: 3 Write backs: 0 Read backs: 0 Cost with: 50 Cost without: 300 Promoting replacement lvaGrabTemp returning 14 (V14 tmp12) (a long lifetime temp) called for V01.[012..016). V01 promoted with 4 replacements [000..004) promoted as float V11 [004..008) promoted as float V12 [008..012) promoted as float V13 [012..016) promoted as float V14 Computing unpromoted remainder for V01 Remainder: [016..024) ``` We end up creating IR that DNERs V04: ```scala STMT00001 ( 0x000[E-] ... ??? ) [000017] DA--G------ ▌ STORE_LCL_VAR struct V01 loc0 [000031] ----------- └──▌ LCL_VAR struct(P) V04 tmp2 ▌ simd8 V04.:X (offs=0x00) -> V08 tmp6 (last use) ▌ simd8 V04.:Y (offs=0x08) -> V09 tmp7 (last use) ▌ simd8 V04.:Z (offs=0x10) -> V10 tmp8 (last use) Processing block operation [000017] that involves replacements V11 (V01.[000..004)) <- src+000 V12 (V01.[004..008)) <- src+004 V13 (V01.[008..012)) <- src+008 V14 (V01.[012..016)) <- src+012 => Remainder strategy: do nothing (remainder dying) Local V04 should not be enregistered because: was accessed as a local field Local V04 should not be enregistered because: was accessed as a local field Local V04 should not be enregistered because: was accessed as a local field Local V04 should not be enregistered because: was accessed as a local field New statement: STMT00001 ( 0x000[E-] ... ??? ) [000075] -A--------- ▌ COMMA void [000066] DA--------- ├──▌ STORE_LCL_VAR float V11 tmp9 [000065] ----------- │ └──▌ LCL_FLD float (P) V04 tmp2 [+0] │ ▌ simd8 V04.:X (offs=0x00) -> V08 tmp6 │ ▌ simd8 V04.:Y (offs=0x08) -> V09 tmp7 │ ▌ simd8 V04.:Z (offs=0x10) -> V10 tmp8 [000074] -A--------- └──▌ COMMA void [000068] DA--------- ├──▌ STORE_LCL_VAR float V12 tmp10 [000067] ----------- │ └──▌ LCL_FLD float (P) V04 tmp2 [+4] │ ▌ simd8 V04.:X (offs=0x00) -> V08 tmp6 │ ▌ simd8 V04.:Y (offs=0x08) -> V09 tmp7 │ ▌ simd8 V04.:Z (offs=0x10) -> V10 tmp8 [000073] -A--------- └──▌ COMMA void [000070] DA--------- ├──▌ STORE_LCL_VAR float V13 tmp11 [000069] ----------- │ └──▌ LCL_FLD float (P) V04 tmp2 [+8] │ ▌ simd8 V04.:X (offs=0x00) -> V08 tmp6 │ ▌ simd8 V04.:Y (offs=0x08) -> V09 tmp7 │ ▌ simd8 V04.:Z (offs=0x10) -> V10 tmp8 [000072] DA--------- └──▌ STORE_LCL_VAR float V14 tmp12 [000071] ----------- └──▌ LCL_FLD float (P) V04 tmp2 [+12] ▌ simd8 V04.:X (offs=0x00) -> V08 tmp6 ▌ simd8 V04.:Y (offs=0x08) -> V09 tmp7 ▌ simd8 V04.:Z (offs=0x10) -> V10 tmp8 ``` This is missing `GetElement/WithElement` handling that local morph has. With that handling I think we would end up with: ```scala [000083] -A--------- ▌ COMMA void [000068] DA--------- ├──▌ STORE_LCL_VAR float V11 tmp9 [000067] ----------- │ └──▌ HWINTRINSIC float float ToScalar [000066] ----------- │ └──▌ LCL_VAR simd8 V08 tmp6 [000082] -A--------- └──▌ COMMA void [000072] DA--------- ├──▌ STORE_LCL_VAR float V12 tmp10 [000071] ----------- │ └──▌ HWINTRINSIC float float GetElement [000070] ----------- │ ├──▌ LCL_VAR simd8 V08 tmp6 [000069] ----------- │ └──▌ CNS_INT int 1 [000081] -A--------- └──▌ COMMA void [000076] DA--------- ├──▌ STORE_LCL_VAR float V13 tmp11 [000075] ----------- │ └──▌ HWINTRINSIC float float ToScalar [000074] ----------- │ └──▌ LCL_VAR simd8 V09 tmp7 [000080] DA--------- └──▌ STORE_LCL_VAR float V14 tmp12 [000079] ----------- └──▌ HWINTRINSIC float float GetElement [000078] ----------- ├──▌ LCL_VAR simd8 V09 tmp7 [000077] ----------- └──▌ CNS_INT int 1 ``` Hacking this in we end up folding the entire benchmark to a constant: ```asm ; Assembly listing for method System.Numerics.Tests.Perf_Matrix3x2:GetDeterminantBenchmark():float:this ; Emitting BLENDED_CODE for X64 with AVX512 - Windows ; optimized code ; rsp based frame ; partially interruptible ; No matching PGO data ; 0 inlinees with PGO data; 6 single block inlinees; 0 inlinees without PGO data ; Final local variable assignments ; ;* V00 this [V00 ] ( 0, 0 ) ref -> zero-ref this class-hnd single-def ;* V01 loc0 [V01 ] ( 0, 0 ) struct (24) zero-ref do-not-enreg[SF] ld-addr-op ;# V02 OutArgs [V02 ] ( 1, 1 ) struct ( 0) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace" ;* V03 tmp1 [V03 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline stloc first use temp" ;* V04 tmp2 [V04 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline ldloca(s) first use temp" ;* V05 tmp3 [V05 ] ( 0, 0 ) simd8 -> zero-ref V03.X(offs=0x00) P-INDEP "field V03.X (fldOffset=0x0)" ;* V06 tmp4 [V06 ] ( 0, 0 ) simd8 -> zero-ref V03.Y(offs=0x08) P-INDEP "field V03.Y (fldOffset=0x8)" ;* V07 tmp5 [V07 ] ( 0, 0 ) simd8 -> zero-ref V03.Z(offs=0x10) P-INDEP "field V03.Z (fldOffset=0x10)" ;* V08 tmp6 [V08,T00] ( 0, 0 ) simd8 -> zero-ref single-def V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)" ;* V09 tmp7 [V09,T01] ( 0, 0 ) simd8 -> zero-ref single-def V04.Y(offs=0x08) P-INDEP "field V04.Y (fldOffset=0x8)" ;* V10 tmp8 [V10 ] ( 0, 0 ) simd8 -> zero-ref single-def V04.Z(offs=0x10) P-INDEP "field V04.Z (fldOffset=0x10)" ;* V11 tmp9 [V11,T02] ( 0, 0 ) float -> zero-ref single-def "V01.[000..004)" ;* V12 tmp10 [V12,T03] ( 0, 0 ) float -> zero-ref single-def "V01.[004..008)" ;* V13 tmp11 [V13,T04] ( 0, 0 ) float -> zero-ref single-def "V01.[008..012)" ;* V14 tmp12 [V14,T05] ( 0, 0 ) float -> zero-ref single-def "V01.[012..016)" ; ; Lcl frame size = 0 G_M33935_IG01: ;; offset=0000H vzeroupper ;; size=3 bbWeight=1 PerfScore 1.00 G_M33935_IG02: ;; offset=0003H vmovss xmm0, dword ptr [reloc @RWD00] ;; size=8 bbWeight=1 PerfScore 3.00 G_M33935_IG03: ;; offset=000BH ret ;; size=1 bbWeight=1 PerfScore 1.00 RWD00 dd 3F800000h ; 1 ; Total bytes of code 12, prolog size 3, PerfScore 6.20, instruction count 3, allocated bytes for code 12 (MethodHash=96117b70) for method System.Numerics.Tests.Perf_Matrix3x2:GetDeterminantBenchmark():float:this ```

jakobbotsch commented 1 year ago

(edit: not expected to be handled)

+44 (+27.67%) : 1550.dasm - System.Text.Json.Utf8JsonReader:get_CurrentState():System.Text.Json.JsonReaderState:this

```diff @@ -8,55 +8,63 @@ ; Final local variable assignments ; ; V00 this [V00,T00] ( 12, 12 ) byref -> rcx this single-def -; V01 RetBuf [V01,T02] ( 4, 4 ) byref -> rbx single-def -; V02 loc0 [V02,T01] ( 12, 12 ) struct (56) [rsp+08H] do-not-enreg[SF] ld-addr-op +; V01 RetBuf [V01,T01] ( 12, 12 ) byref -> rbx single-def +; V02 loc0 [V02,T02] ( 4, 4 ) struct (56) [rsp+10H] do-not-enreg[SF] ld-addr-op ;# V03 OutArgs [V03 ] ( 1, 1 ) struct ( 0) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace" +; V04 tmp1 [V04,T03] ( 2, 2 ) long -> rbp "V02.[000..008)" +; V05 tmp2 [V05,T04] ( 2, 2 ) long -> r14 "V02.[008..016)" +; V06 tmp3 [V06,T05] ( 2, 2 ) bool -> r15 "V02.[016..017)" +; V07 tmp4 [V07,T06] ( 2, 2 ) bool -> r12 "V02.[017..018)" +; V08 tmp5 [V08,T07] ( 2, 2 ) bool -> r13 "V02.[018..019)" +; V09 tmp6 [V09,T08] ( 2, 2 ) bool -> [rsp+0CH] spill-single-def "V02.[019..020)" +; V10 tmp7 [V10,T09] ( 2, 2 ) ubyte -> [rsp+08H] spill-single-def "V02.[020..021)" +; V11 tmp8 [V11,T10] ( 2, 2 ) ubyte -> [rsp+04H] spill-single-def "V02.[021..022)" ; -; Lcl frame size = 64 +; Lcl frame size = 72 G_M2776_IG01: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG + push r15 + push r14 + push r13 + push r12 push rdi push rsi + push rbp push rbx - sub rsp, 64 + sub rsp, 72 vzeroupper mov rbx, rdx ; byrRegs +[rbx] - ;; size=13 bbWeight=1 PerfScore 4.50 + ;; size=22 bbWeight=1 PerfScore 9.50 G_M2776_IG02: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=000A {rcx rbx}, byref ; byrRegs +[rcx] vxorps ymm0, ymm0, ymm0 - vmovdqu ymmword ptr [rsp+08H], ymm0 - vmovdqu ymmword ptr [rsp+20H], ymm0 - mov rax, qword ptr [rcx] - mov qword ptr [rsp+08H], rax - mov rax, qword ptr [rcx+08H] - mov qword ptr [rsp+10H], rax - movzx rax, byte ptr [rcx+26H] - mov byte ptr [rsp+18H], al - movzx rax, byte ptr [rcx+27H] - mov byte ptr [rsp+19H], al - movzx rax, byte ptr [rcx+2EH] - mov byte ptr [rsp+1AH], al + vmovdqu ymmword ptr [rsp+10H], ymm0 + vmovdqu ymmword ptr [rsp+28H], ymm0 + mov rbp, qword ptr [rcx] + mov r14, qword ptr [rcx+08H] + movzx r15, byte ptr [rcx+26H] + movzx r12, byte ptr [rcx+27H] + movzx r13, byte ptr [rcx+2EH] movzx rax, byte ptr [rcx+2CH] - mov byte ptr [rsp+1BH], al - movzx rax, byte ptr [rcx+28H] - mov byte ptr [rsp+1CH], al - movzx rax, byte ptr [rcx+29H] - mov byte ptr [rsp+1DH], al - mov rax, qword ptr [rcx+40H] - mov qword ptr [rsp+20H], rax - ;; size=90 bbWeight=1 PerfScore 29.33 + mov dword ptr [rsp+0CH], eax + movzx rdx, byte ptr [rcx+28H] + mov dword ptr [rsp+08H], edx + movzx r8, byte ptr [rcx+29H] + mov dword ptr [rsp+04H], r8d + mov r9, qword ptr [rcx+40H] + mov qword ptr [rsp+28H], r9 + ;; size=73 bbWeight=1 PerfScore 24.33 G_M2776_IG03: ; bbWeight=1, nogc, extend vmovdqu xmm0, xmmword ptr [rcx+48H] - vmovdqu xmmword ptr [rsp+28H], xmm0 - mov rax, qword ptr [rcx+58H] - mov qword ptr [rsp+38H], rax + vmovdqu xmmword ptr [rsp+30H], xmm0 + mov r9, qword ptr [rcx+58H] + mov qword ptr [rsp+40H], r9 ;; size=20 bbWeight=1 PerfScore 8.00 G_M2776_IG04: ; bbWeight=1, extend mov rdi, rbx ; byrRegs +[rdi] - lea rsi, bword ptr [rsp+08H] + lea rsi, bword ptr [rsp+10H] ; byrRegs +[rsi] mov ecx, 4 ; byrRegs -[rcx] @@ -64,18 +72,34 @@ G_M2776_IG04: ; bbWeight=1, extend call CORINFO_HELP_ASSIGN_BYREF movsq movsq + mov qword ptr [rbx], rbp + mov qword ptr [rbx+08H], r14 + mov byte ptr [rbx+10H], r15b + mov byte ptr [rbx+11H], r12b + mov byte ptr [rbx+12H], r13b + mov ebp, dword ptr [rsp+0CH] + mov byte ptr [rbx+13H], bpl + mov ebp, dword ptr [rsp+08H] + mov byte ptr [rbx+14H], bpl + mov ebp, dword ptr [rsp+04H] + mov byte ptr [rbx+15H], bpl mov rax, rbx ; byrRegs +[rax] - ;; size=28 bbWeight=1 PerfScore 29.25 + ;; size=71 bbWeight=1 PerfScore 40.25 G_M2776_IG05: ; bbWeight=1, epilog, nogc, extend - add rsp, 64 + add rsp, 72 pop rbx + pop rbp pop rsi pop rdi + pop r12 + pop r13 + pop r14 + pop r15 ret - ;; size=8 bbWeight=1 PerfScore 2.75 + ;; size=17 bbWeight=1 PerfScore 5.25 -; Total bytes of code 159, prolog size 10, PerfScore 89.73, instruction count 44, allocated bytes for code 159 (MethodHash=d49af527) for method System.Text.Json.Utf8JsonReader:get_CurrentState():System.Text.Json.JsonReaderState:this +; Total bytes of code 203, prolog size 19, PerfScore 107.63, instruction count 60, allocated bytes for code 203 (MethodHash=d49af527) for method System.Text.Json.Utf8JsonReader:get_CurrentState():System.Text.Json.JsonReaderState:this ``` Replacements: ```scala Accesses for V02 [000..056) as System.Text.Json.JsonReaderState #: (2, 200) # assigned from: (1, 100) # assigned to: (1, 100) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) long @ 000 #: (1, 100) # assigned from: (0, 0) # assigned to: (1, 100) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) long @ 008 #: (1, 100) # assigned from: (0, 0) # assigned to: (1, 100) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) bool @ 016 #: (1, 100) # assigned from: (0, 0) # assigned to: (1, 100) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) bool @ 017 #: (1, 100) # assigned from: (0, 0) # assigned to: (1, 100) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) bool @ 018 #: (1, 100) # assigned from: (0, 0) # assigned to: (1, 100) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) bool @ 019 #: (1, 100) # assigned from: (0, 0) # assigned to: (1, 100) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) ubyte @ 020 #: (1, 100) # assigned from: (0, 0) # assigned to: (1, 100) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) ubyte @ 021 #: (1, 100) # assigned from: (0, 0) # assigned to: (1, 100) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) [024..032) as System.Text.Json.JsonReaderOptions #: (1, 100) # assigned from: (0, 0) # assigned to: (1, 100) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) [032..056) as System.Text.Json.BitStack #: (1, 100) # assigned from: (0, 0) # assigned to: (1, 100) # as call arg: (0, 0) # as retbuf: (0, 0) # as returned value: (0, 0) Picking promotions for V02 Evaluating access long @ 000 Single write-back cost: 3 Write backs: 0 Read backs: 0 Cost with: 50 Cost without: 300 Promoting replacement lvaGrabTemp returning 4 (V04 tmp1) (a long lifetime temp) called for V02.[000..008). Evaluating access long @ 008 Single write-back cost: 3 Write backs: 0 Read backs: 0 Cost with: 50 Cost without: 300 Promoting replacement lvaGrabTemp returning 5 (V05 tmp2) (a long lifetime temp) called for V02.[008..016). Evaluating access bool @ 016 Single write-back cost: 3 Write backs: 0 Read backs: 0 Cost with: 50 Cost without: 300 Promoting replacement lvaGrabTemp returning 6 (V06 tmp3) (a long lifetime temp) called for V02.[016..017). Evaluating access bool @ 017 Single write-back cost: 3 Write backs: 0 Read backs: 0 Cost with: 50 Cost without: 300 Promoting replacement lvaGrabTemp returning 7 (V07 tmp4) (a long lifetime temp) called for V02.[017..018). Evaluating access bool @ 018 Single write-back cost: 3 Write backs: 0 Read backs: 0 Cost with: 50 Cost without: 300 Promoting replacement lvaGrabTemp returning 8 (V08 tmp5) (a long lifetime temp) called for V02.[018..019). Evaluating access bool @ 019 Single write-back cost: 3 Write backs: 0 Read backs: 0 Cost with: 50 Cost without: 300 Promoting replacement lvaGrabTemp returning 9 (V09 tmp6) (a long lifetime temp) called for V02.[019..020). Evaluating access ubyte @ 020 Single write-back cost: 3 Write backs: 0 Read backs: 0 Cost with: 50 Cost without: 300 Promoting replacement lvaGrabTemp returning 10 (V10 tmp7) (a long lifetime temp) called for V02.[020..021). Evaluating access ubyte @ 021 Single write-back cost: 3 Write backs: 0 Read backs: 0 Cost with: 50 Cost without: 300 Promoting replacement lvaGrabTemp returning 11 (V11 tmp8) (a long lifetime temp) called for V02.[021..022). V02 promoted with 8 replacements [000..008) promoted as long V04 [008..016) promoted as long V05 [016..017) promoted as bool V06 [017..018) promoted as bool V07 [018..019) promoted as bool V08 [019..020) promoted as bool V09 [020..021) promoted as ubyte V10 [021..022) promoted as ubyte V11 Computing unpromoted remainder for V02 Remainder: [024..056) ``` This is one of the cases where the heuristic does not take into account that decomposed assignments can be more expensive with many fields, especially considering we end spilling some of the fields. We end up with ```scala Processing block operation [000065] that involves replacements dst+000 <- V04 (V02.[000..008)) (last use) dst+008 <- V05 (V02.[008..016)) (last use) dst+016 <- V06 (V02.[016..017)) (last use) dst+017 <- V07 (V02.[017..018)) (last use) dst+018 <- V08 (V02.[018..019)) (last use) dst+019 <- V09 (V02.[019..020)) (last use) dst+020 <- V10 (V02.[020..021)) (last use) dst+021 <- V11 (V02.[021..022)) (last use) Remainder: [024..056) => Remainder strategy: retain a full block op New statement: STMT00012 ( 0x08A[E-] ... 0x08B ) [000124] -A-XG------ ▌ COMMA void [000065] -A-XG------ ├──▌ STORE_BLK struct (copy) [000079] ----------- │ ├──▌ LCL_VAR byref V01 RetBuf [000063] ----------- │ └──▌ LCL_VAR struct V02 loc0 [000123] -A-XG------ └──▌ COMMA void [000082] -A-XG------ ├──▌ STOREIND long [000081] ----------- │ ├──▌ LCL_VAR byref V01 RetBuf [000080] ----------- │ └──▌ LCL_VAR long V04 tmp1 (last use) [000122] -A-XG------ └──▌ COMMA void [000087] -A-XG------ ├──▌ STOREIND long [000086] ----------- │ ├──▌ ADD byref [000084] ----------- │ │ ├──▌ LCL_VAR byref V01 RetBuf [000085] ----------- │ │ └──▌ CNS_INT long 8 [000083] ----------- │ └──▌ LCL_VAR long V05 tmp2 (last use) [000121] -A-XG------ └──▌ COMMA void [000092] -A-XG------ ├──▌ STOREIND bool [000091] ----------- │ ├──▌ ADD byref [000089] ----------- │ │ ├──▌ LCL_VAR byref V01 RetBuf [000090] ----------- │ │ └──▌ CNS_INT long 16 [000088] ----------- │ └──▌ LCL_VAR bool V06 tmp3 (last use) [000120] -A-XG------ └──▌ COMMA void [000097] -A-XG------ ├──▌ STOREIND bool [000096] ----------- │ ├──▌ ADD byref [000094] ----------- │ │ ├──▌ LCL_VAR byref V01 RetBuf [000095] ----------- │ │ └──▌ CNS_INT long 17 [000093] ----------- │ └──▌ LCL_VAR bool V07 tmp4 (last use) [000119] -A-XG------ └──▌ COMMA void [000102] -A-XG------ ├──▌ STOREIND bool [000101] ----------- │ ├──▌ ADD byref [000099] ----------- │ │ ├──▌ LCL_VAR byref V01 RetBuf [000100] ----------- │ │ └──▌ CNS_INT long 18 [000098] ----------- │ └──▌ LCL_VAR bool V08 tmp5 (last use) [000118] -A-XG------ └──▌ COMMA void [000107] -A-XG------ ├──▌ STOREIND bool [000106] ----------- │ ├──▌ ADD byref [000104] ----------- │ │ ├──▌ LCL_VAR byref V01 RetBuf [000105] ----------- │ │ └──▌ CNS_INT long 19 [000103] ----------- │ └──▌ LCL_VAR bool V09 tmp6 (last use) [000117] -A-XG------ └──▌ COMMA void [000112] -A-XG------ ├──▌ STOREIND ubyte [000111] ----------- │ ├──▌ ADD byref [000109] ----------- │ │ ├──▌ LCL_VAR byref V01 RetBuf [000110] ----------- │ │ └──▌ CNS_INT long 20 [000108] ----------- │ └──▌ LCL_VAR ubyte V10 tmp7 (last use) [000116] -A-XG------ └──▌ STOREIND ubyte [000115] ----------- ├──▌ ADD byref [000064] ----------- │ ├──▌ LCL_VAR byref V01 RetBuf [000114] ----------- │ └──▌ CNS_INT long 21 [000113] ----------- └──▌ LCL_VAR ubyte V11 tmp8 (last use) ``` to handle the assignment into the ret buffer. We do see some signs of why it could be beneficial to do the promotion as we are able to keep a bunch of the fields in registers instead of on stack, but we just don't have enough registers on x64 to do that for them all.

jakobbotsch commented 1 year ago

With the perflab runs @cincuranet set up and a query from @AndyAyersMS I can start looking at micro benchmark regressions. The following lists all benchmarks with a ratio below 0.95, indicating that they regress by more than 5%. There are 56 entries in this list (for comparison, the query for benchmarks that improve by more than 5% returns 267 results, but take it with a grain of salt as many of these are noisy). The quality columns are computed as median divided by standard deviation, so larger numbers indicate more stable benchmarks.

Notes	Benchmark	Ratio	Promotion median	Default median	Promotion quality	Default quality
Bimodal	PerfLabTests.CastingPerf.CheckObjIsInterfaceNo	0.50201266706593484	62373.178950863221	31312.125918503676	4.1721551887971193	2.0068501317441818
Bimodal	PerfLabTests.CastingPerf.CheckIsInstAnyIsInterfaceYes	0.5020356039475935	62373.964854866252	31313.95111651875	4.2768060071352947	2.0060349494210894
Bimodal	PerfLabTests.CastingPerf.CheckObjIsInterfaceYes	0.50204105771866514	62373.932840068293	31314.275217100869	4.2853027310745686	2.0074065296253232
Bimodal	PerfLabTests.CastingPerf.CheckIsInstAnyIsInterfaceNo	0.50216288733931325	62373.730079681263	31321.772390935719	4.0927570892698251	2.0097825475492144
Bimodal	PerfLabTests.LowLevelPerf.EmptyStaticFunction	0.7374686274234068	2604944.2708333335	1921064.6759259258	7.1867237699720317	3.0824059321076045
Noisy	`MicroBenchmarks.Serializers.Xml_ToStream<MyEventsListerViewModel>.XmlSerializer_`	0.74302975696458551	660707.35294117662	490925.2238805971	4.9061924677629705	3.7168701617547995
Maybe? Need more data	`System.Memory.Constructors<Byte>.ArrayAsMemory`	0.79356832092520146	2.3856637796570128	1.8931871999144854	6.3931791008772434	4.7934943297225
Bimodal	`System.Memory.Constructors<String>.MemoryMarshalCreateSpan`	0.80252756077275078	1.5784818512235037	1.2667751897864545	5.5628662599349337	5.0711645321790915
Bimodal	`System.Memory.Constructors<String>.MemoryMarshalCreateReadOnlySpan`	0.80496260867919744	1.5785205328904912	1.2706500060092067	5.7538606845191165	4.9616066679124549
Bimodal	PerfLabTests.CastingPerf2.CastingPerf.IntObj	0.8285766871531669	226939.75845410625	188036.99324324325	12.390298163607353	9.8413406239725
Regression (vec cns reuse)	System.Numerics.Tests.Perf_Matrix4x4.IsIdentityBenchmark	0.83234334609059291	1.2211270243560317	1.0163969534541484	10.778735229862832	7.8101018258309587
Multimodal	`System.Collections.ContainsFalse<Int32>.Span(Size: 512)`	0.845401219708253	29126.453232893906	24623.539088863898	11.021424747595237	13.417315107098991
Maybe? Need more data	`System.Collections.TryGetValueTrue<Int32, Int32>.Dictionary(Size: 512)`	0.84571433947519314	3795.9668312075878	3210.3035813244669	7.9431221125054137	6.7635448167816792
Bimodal	System.Tests.Perf_Enum.IsDefined_Generic_NonFlags	0.85163797170392563	3.1193090478744949	2.6565220306495383	8.8673428674969017	8.5158225518187
Maybe? need more data	Devirtualization.EqualityComparer.ValueTupleCompareNoOpt	0.85240213125413522	5.8012179366666068	4.9449705330843328	9.3287988994501845	7.4879186079626407
Regression (pipelining)	System.Numerics.Tests.Perf_Matrix3x2.SubtractBenchmark	0.85729075729610349	1.62796924888388	1.3956429902304301	44.749404427588338	12.091485745364375
Regression (pipelining)	System.Numerics.Tests.Perf_Matrix3x2.AddOperatorBenchmark	0.85737159681747388	1.6278395951425713	1.3956634330500965	63.074207379712796	10.42100793946469
Regression (pipelining)	System.Numerics.Tests.Perf_Matrix3x2.SubtractOperatorBenchmark	0.85739598139557893	1.627907761982518	1.3957615732064814	38.580582580347361	13.237060244441501
Regression (pipelining)	System.Numerics.Tests.Perf_Matrix3x2.AddBenchmark	0.85795657072580145	1.6266881427332425	1.3956277805797357	26.052276152907275	9.560646773106166
Bimodal	System.MathBenchmarks.MathTests.DivRemInt32	0.85968833841252423	1.5068739322565785	1.2954419470188046	12.849580314598676	7.9300084772432688
Modal, but check again (only spiked with phys prom)	PerfLabTests.CastingPerf.ObjObjrefValueType	0.8646586924544396	361425.11420265783	312509.36666666664	14.971126929257249	15.218101244318584
Like above	PerfLabTests.CastingPerf.FooObjIsNull	0.86607371151458068	361202.81007751933	312828.25833333336	14.968123248057182	15.452934201966883
Bimodal	PerfLabTests.LowLevelPerf.GenericGenericMethod	0.86612762496753892	187101.49305555556	162053.77180808882	15.107030942049	11.493496309257749
Modal, but check again (only spiked with phys prom)	PerfLabTests.CastingPerf.ObjInt	0.86704330919561978	360542.60349025979	312606.05203619908	14.650520670993343	14.998381025621656
Like above	PerfLabTests.CastingPerf.ObjFooIsObj	0.86731406352131957	360339.93371212116	312527.89215686271	14.967143545401138	15.27578920376382
Like above	PerfLabTests.CastingPerf.ObjScalarValueType	0.86780419724880409	360123.47537878784	312516.66346153844	14.864352722727245	15.405796220386524
Maybe? Need more data	Microsoft.Extensions.Primitives.Performance.StringValuesBenchmark.ForEach_Array	0.86930747689206089	5.2903566301283576	4.5989465739960682	30.026761512990653	10.35044745664271
Bimodal regression and improvement	System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "Sherlock\|Holmes\|Watson\|Irene\|Adler\|John\|Baker", Options: NonBacktracking)	0.870338288956494	1567827.0673076923	1364539.9271402548	11.60696233951157	14.090632090500138
Regression (pipelining, #87554)	System.Numerics.Tests.Perf_Matrix3x2.MultiplyByMatrixOperatorBenchmark	0.87078330292883266	4.5609575925251837	3.9716057169374164	100.14896735833135	10.364535617674489
Regression (pipelijning, #87554)	System.Numerics.Tests.Perf_Matrix3x2.MultiplyByMatrixBenchmark	0.87095941903031426	4.5610356119008992	3.9724769267177811	129.83253482155473	9.5704310251011382
Noisy	System.Tests.Perf_String.Trim(s: "Test")	0.87443817495914056	2.5474217375187354	2.2275628150071256	8.6368738268479444	10.77430293074069
Bimodal	Benchstone.BenchI.BubbleSort.Test	0.88167317428377268	13708.670704845816	12086.567215552373	17.710410860982513	14.8836516447514
Bimodal	PerfLabTests.CastingPerf.FooObjIsFoo2	0.888850848188488	434724.572368421	386405.30487804877	16.696742581172664	14.025034933825713
Multimodal	`System.Collections.ContainsKeyTrue<Int32, Int32>.Dictionary(Size: 512)`	0.90009475716937359	3514.2079858763432	3163.1201836900404	12.024653108378379	8.0678159153909181
Maybe? Need more data	PerfLabTests.CastingPerf.CheckArrayIsArrayByVariance	0.90067954767068859	2.7450413012046524	2.4724025575063648	2.7009895041042808	1.7932475048783776
Maybe? Need more data	System.Memory.ReadOnlySequence.Slice_Start(Segment: Multiple)	0.90964038072783471	3.4493846916202595	3.1376996041622176	8.9030611892736591	7.7368085557620461
Maybe? Need more data	System.Buffers.Tests.ReadOnlySequenceTests.FirstTenSegments	0.91450624664500813	5.0942189368202619	4.6586950394994213	13.250302455948326	5.93807832331802
Maybe? Need more data	System.Numerics.Tests.Perf_Matrix3x2.EqualityOperatorBenchmark	0.91505574135403611	1.7038860172829366	1.5591506827276136	18.225174899407335	16.6845459834044
Maybe? Need more data	System.Tests.Perf_Boolean.TryParse(value: "0")	0.91676396690683648	3.2874644520224661	3.0138289521013255	12.304077459929706	9.4382205739902538
Noisy	System.Text.Perf_Utf8Encoding.GetBytes(Input: Chinese)	0.9205757115569656	161762.16635338342	148914.32139376219	15.828556551677019	15.141829429651327
Noisy	System.Text.RegularExpressions.Tests.Perf_Regex_Cache.IsMatch(total: 40000, unique: 7, cacheSize: 0)	0.9227115889387062	59276610	54695215	10.351467756318263	8.71913930818749
Noisy	BenchmarksGame.BinaryTrees_5.RunBench	0.92322071325312827	179044422.5	165297519.44444445	22.266548353098084	20.890177523956297
Noisy	`System.Memory.Span<Byte>.Clear(Size: 512)`	0.92331343747602457	6.4115322856679935	5.9198539141686277	2.4348688680033095	2.7869962552546239
Noisy	System.Text.RegularExpressions.Tests.Perf_Regex_Cache.IsMatch_Multithreading(total: 40000, unique: 7, cacheSize: 0)	0.9236300501927136	18645578.75	17221616.836734693	13.580364545045281	12.069896150198117
Maybe? Need more data	Benchstone.BenchI.XposMatrix.Test	0.93003313539222943	18078.169075144509	16813.296267107489	69.731706322545591	25.949469557869556
Noisy	Benchstone.BenchI.AddArray.Test	0.93120132537564571	20210.202752976191	18819.767589681953	14.050293215847208	15.155225360820852
Noisy	System.Buffers.Tests.ReadOnlySequenceTests.FirstTenSegments	0.93132417841360293	4.6180075165850623	4.3008620562914261	8.1717473277435051	6.1868419993236987
Bimodal regression and improvement	System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "(?i)Sherlock\|Holmes\|Watson", Options: NonBacktracking)	0.93138653274623273	2780731.38576779	2589935.763888889	14.820176511579046	20.266737041566497
Noisy	System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateFromFile_Read(capacity: 10000000)	0.93154070667252664	43590.89447463768	40606.692643391521	17.251682799323238	14.914235278894841
Noisy	`System.Collections.CtorDefaultSize<String>.Stack`	0.93671812574135571	17.861669480703526	16.731349558576181	16.985916186371369	14.140773457045354
Bimodal	Benchstone.BenchI.Fib.Test	0.93955184884835363	159694.67229199369	150041.42460317462	20.749516161852206	22.520675763619309
Maybe? Need more data	Benchstone.BenchI.IniArray.Test	0.940509770186889	67189541.25	63192420	5.0790460095421039	5.8211886020145069
Bimodal	System.Memory.Span.IndexOfAnyFourValues(Size: 33)	0.94573448370538127	57.653966580812074	54.525344317871614	23.807900467166885	21.941204410312682
Maybe? Need more data	System.Runtime.Intrinsics.Tests.Perf_Vector128Int.GetHashCodeBenchmark	0.94589448437699652	12.750723643616887	12.060839166312574	18.816588477582815	17.400192778480125
Maybe? Need more data	System.Collections.Tests.Perf_PriorityQueue<Int32, Int32>.HeapSort(Size: 1000)	0.94921192013119116	75208.203125	71388.5228978979	38.440681648109951	23.375784409521643
Bimodal	System.Collections.Tests.Perf_Dictionary.ContainsValue(Items: 3000)	0.94928848058446713	4249593.2471264368	4034089.916666667	22.034137694330397	18.666116210466772
Noisy (no asm diffs)	`System.Buffers.Tests.ReadOnlySequenceTests<Byte>.FirstSingleSegment`
Noisy (no asm diffs)	`System.Memory.ReadOnlySequence.Slice_Repeat_StartPosition_And_EndPosition(Segment: Multiple)`
Bimodal (no asm diffs)	System.Tests.Perf_Type.op_Equality

jakobbotsch commented 1 year ago

System.Numerics.Tests.Perf_Matrix4x4.IsIdentityBenchmark

Same as https://github.com/dotnet/runtime/issues/76928#issuecomment-1582560100: ```diff @@ -92,19 +92,23 @@ G_M3814_IG03: ;; offset=003BH ; Final local variable assignments ; ;* V00 this [V00 ] ( 0, 0 ) ref -> zero-ref this class-hnd single-def -;* V01 loc0 [V01,T00] ( 0, 0 ) struct (64) zero-ref do-not-enreg[SF] ld-addr-op +;* V01 loc0 [V01 ] ( 0, 0 ) struct (64) zero-ref do-not-enreg[SF] ld-addr-op ;# V02 OutArgs [V02 ] ( 1, 1 ) struct ( 0) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace" ;* V03 tmp1 [V03 ] ( 0, 0 ) struct (64) zero-ref do-not-enreg[S] ld-addr-op "Inline stloc first use temp" ;* V04 tmp2 [V04 ] ( 0, 0 ) struct (64) zero-ref ld-addr-op "Inline ldloca(s) first use temp" -; V05 tmp3 [V05,T01] ( 3, 2 ) bool -> rax "Inline return value spill temp" -;* V06 tmp4 [V06,T06] ( 0, 0 ) simd16 -> zero-ref single-def V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)" -;* V07 tmp5 [V07,T07] ( 0, 0 ) simd16 -> zero-ref single-def V04.Y(offs=0x10) P-INDEP "field V04.Y (fldOffset=0x10)" -;* V08 tmp6 [V08,T08] ( 0, 0 ) simd16 -> zero-ref single-def V04.Z(offs=0x20) P-INDEP "field V04.Z (fldOffset=0x20)" -;* V09 tmp7 [V09,T09] ( 0, 0 ) simd16 -> zero-ref single-def V04.W(offs=0x30) P-INDEP "field V04.W (fldOffset=0x30)" -; V10 cse0 [V10,T02] ( 3, 3 ) simd16 -> mm0 "CSE - aggressive" -; V11 cse1 [V11,T03] ( 3, 2 ) simd16 -> mm1 "CSE - aggressive" -; V12 cse2 [V12,T04] ( 3, 2 ) simd16 -> mm2 "CSE - aggressive" -; V13 cse3 [V13,T05] ( 3, 2 ) simd16 -> mm3 "CSE - aggressive" +; V05 tmp3 [V05,T00] ( 3, 2 ) bool -> rax "Inline return value spill temp" +;* V06 tmp4 [V06,T05] ( 0, 0 ) simd16 -> zero-ref single-def V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)" +;* V07 tmp5 [V07,T06] ( 0, 0 ) simd16 -> zero-ref single-def V04.Y(offs=0x10) P-INDEP "field V04.Y (fldOffset=0x10)" +;* V08 tmp6 [V08,T07] ( 0, 0 ) simd16 -> zero-ref single-def V04.Z(offs=0x20) P-INDEP "field V04.Z (fldOffset=0x20)" +;* V09 tmp7 [V09,T08] ( 0, 0 ) simd16 -> zero-ref single-def V04.W(offs=0x30) P-INDEP "field V04.W (fldOffset=0x30)" +;* V10 tmp8 [V10 ] ( 0, 0 ) simd16 -> zero-ref single-def "V01.[000..016)" +;* V11 tmp9 [V11,T09] ( 0, 0 ) simd16 -> zero-ref single-def "V01.[016..032)" +;* V12 tmp10 [V12,T10] ( 0, 0 ) simd16 -> zero-ref single-def "V01.[032..048)" +;* V13 tmp11 [V13,T11] ( 0, 0 ) simd16 -> zero-ref single-def "V01.[048..064)" +; V14 cse0 [V14,T01] ( 2, 2 ) simd16 -> mm0 "CSE - aggressive" +; V15 cse1 [V15,T02] ( 2, 1.50) simd16 -> mm1 "CSE - aggressive" +; V16 cse2 [V16,T03] ( 2, 1.50) simd16 -> mm2 "CSE - aggressive" +; V17 cse3 [V17,T04] ( 2, 1.50) simd16 -> mm3 "CSE - aggressive" ; ; Lcl frame size = 0 @@ -116,31 +120,31 @@ G_M3814_IG02: ;; offset=0003H vmovups xmm1, xmmword ptr [reloc @RWD16] vmovups xmm2, xmmword ptr [reloc @RWD32] vmovups xmm3, xmmword ptr [reloc @RWD48] - vcmpps xmm0, xmm0, xmm0, 0 + vcmpps xmm0, xmm0, xmmword ptr [reloc @RWD00], 0 vmovmskps rax, xmm0 cmp eax, 15 jne SHORT G_M3814_IG04 - ;; size=46 bbWeight=1 PerfScore 18.25 -G_M3814_IG03: ;; offset=0031H - vcmpps xmm0, xmm1, xmm1, 0 + ;; size=50 bbWeight=1 PerfScore 18.25 +G_M3814_IG03: ;; offset=0035H + vcmpps xmm0, xmm1, xmmword ptr [reloc @RWD16], 0 vmovmskps rax, xmm0 cmp eax, 15 jne SHORT G_M3814_IG04 - vcmpps xmm0, xmm2, xmm2, 0 + vcmpps xmm0, xmm2, xmmword ptr [reloc @RWD32], 0 vmovmskps rax, xmm0 cmp eax, 15 jne SHORT G_M3814_IG04 - vcmpps xmm0, xmm3, xmm3, 0 + vcmpps xmm0, xmm3, xmmword ptr [reloc @RWD48], 0 vmovmskps rax, xmm0 cmp eax, 15 sete al movzx rax, al jmp SHORT G_M3814_IG05 - ;; size=48 bbWeight=0.50 PerfScore 10.50 -G_M3814_IG04: ;; offset=0061H + ;; size=60 bbWeight=0.50 PerfScore 10.50 +G_M3814_IG04: ;; offset=0071H xor eax, eax ;; size=2 bbWeight=0.50 PerfScore 0.12 -G_M3814_IG05: ;; offset=0063H +G_M3814_IG05: ;; offset=0073H ret ;; size=1 bbWeight=1 PerfScore 1.00 RWD00 dq 000000003F800000h, 0000000000000000h @@ -149,7 +153,7 @@ RWD32 dq 0000000000000000h, 000000003F800000h RWD48 dq 0000000000000000h, 3F80000000000000h -; Total bytes of code 100, prolog size 3, PerfScore 40.88, instruction count 25, allocated bytes for code 100 (MethodHash=8a71f119) for method Program:IsIdentityBenchmark():bool:this +; Total bytes of code 116, prolog size 3, PerfScore 42.48, instruction count 25, allocated bytes for code 116 (MethodHash=8a71f119) for method Program:IsIdentityBenchmark():bool:this ; ============================================================ -225.8 ms +267.6 ms ```

jakobbotsch commented 1 year ago

System.Numerics.Tests.Perf_Matrix3x2.SubtractBenchmark

```diff @@ -131,13 +131,13 @@ G_M5743_IG03: ;; offset=0077H ; V01 RetBuf [V01,T00] ( 6, 6 ) byref -> rdx single-def ;# V02 OutArgs [V02 ] ( 1, 1 ) struct ( 0) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace" ;* V03 tmp1 [V03 ] ( 0, 0 ) struct (24) zero-ref do-not-enreg[S] "impAppendStmt" -;* V04 tmp2 [V04 ] ( 0, 0 ) struct (24) zero-ref do-not-enreg[S] "spilled call-like call argument" +;* V04 tmp2 [V04,T02] ( 0, 0 ) struct (24) zero-ref do-not-enreg[S] "spilled call-like call argument" ;* V05 tmp3 [V05 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline stloc first use temp" ;* V06 tmp4 [V06 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline ldloca(s) first use temp" ;* V07 tmp5 [V07 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline stloc first use temp" ;* V08 tmp6 [V08 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline ldloca(s) first use temp" ; V09 tmp7 [V09 ] ( 4, 8 ) struct (24) [rsp+00H] do-not-enreg[XS] addr-exposed ld-addr-op "Inlining Arg" -;* V10 tmp8 [V10,T02] ( 0, 0 ) struct (24) zero-ref do-not-enreg[SF] ld-addr-op "Inlining Arg" +;* V10 tmp8 [V10 ] ( 0, 0 ) struct (24) zero-ref do-not-enreg[SF] ld-addr-op "Inlining Arg" ; V11 tmp9 [V11,T01] ( 4, 8 ) byref -> rax single-def "impAppendStmt" ;* V12 tmp10 [V12 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline stloc first use temp" ;* V13 tmp11 [V13 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline ldloca(s) first use temp" @@ -159,8 +159,11 @@ G_M5743_IG03: ;; offset=0077H ; V29 tmp27 [V29,T05] ( 2, 2 ) simd8 -> mm0 V13.X(offs=0x00) P-INDEP "field V13.X (fldOffset=0x0)" ; V30 tmp28 [V30,T06] ( 2, 2 ) simd8 -> mm1 V13.Y(offs=0x08) P-INDEP "field V13.Y (fldOffset=0x8)" ; V31 tmp29 [V31,T07] ( 2, 2 ) simd8 -> mm2 V13.Z(offs=0x10) P-INDEP "field V13.Z (fldOffset=0x10)" -; V32 cse0 [V32,T03] ( 2, 2 ) simd8 -> mm0 "CSE - aggressive" -; V33 cse1 [V33,T04] ( 2, 2 ) simd8 -> mm1 "CSE - aggressive" +;* V32 tmp30 [V32,T14] ( 0, 0 ) simd8 -> zero-ref "V10.[000..008)" +;* V33 tmp31 [V33,T15] ( 0, 0 ) simd8 -> zero-ref "V10.[008..016)" +;* V34 tmp32 [V34,T16] ( 0, 0 ) simd8 -> zero-ref "V10.[016..024)" +;* V35 cse0 [V35,T03] ( 0, 0 ) simd8 -> zero-ref "CSE - aggressive" +;* V36 cse1 [V36,T04] ( 0, 0 ) simd8 -> zero-ref "CSE - aggressive" ; ; Lcl frame size = 24 @@ -170,18 +173,18 @@ G_M5743_IG01: ;; offset=0000H ;; size=7 bbWeight=1 PerfScore 1.25 G_M5743_IG02: ;; offset=0007H vmovsd xmm0, qword ptr [reloc @RWD00] - vmovsd xmm1, qword ptr [reloc @RWD08] - vmovsd xmm2, qword ptr [reloc @RWD00] - vmovsd qword ptr [rsp], xmm2 - vmovsd xmm2, qword ptr [reloc @RWD08] - vmovsd qword ptr [rsp+08H], xmm2 - vxorps xmm2, xmm2, xmm2 - vmovsd qword ptr [rsp+10H], xmm2 + vmovsd qword ptr [rsp], xmm0 + vmovsd xmm0, qword ptr [reloc @RWD08] + vmovsd qword ptr [rsp+08H], xmm0 + vxorps xmm0, xmm0, xmm0 + vmovsd qword ptr [rsp+10H], xmm0 lea rax, bword ptr [rsp] - vmovsd xmm2, qword ptr [rax] - vsubps xmm0, xmm2, xmm0 - vmovsd xmm2, qword ptr [rax+08H] - vsubps xmm1, xmm2, xmm1 + vmovsd xmm0, qword ptr [rax] + vmovsd xmm1, qword ptr [reloc @RWD00] + vsubps xmm0, xmm0, xmm1 + vmovsd xmm1, qword ptr [rax+08H] + vmovsd xmm2, qword ptr [reloc @RWD08] + vsubps xmm1, xmm1, xmm2 vmovsd xmm2, qword ptr [rax+10H] vxorps xmm3, xmm3, xmm3 vsubps xmm2, xmm2, xmm3 @@ -201,4 +204,4 @@ RWD08 dq 3F80000000000000h ; Total bytes of code 116, prolog size 7, PerfScore 57.52, instruction count 24, allocated bytes for code 116 (MethodHash=9699e990) for method Program:SubtractBenchmark():System.Numerics.Matrix3x2:this ``` Looks like physical promotion ends up with slightly different pipelining, which seems worse in the lab (however on my laptop Intel CPU, it seems to be sometimes faster than the original). The codegen for this benchmark is terrible with and without physical promotion. The problem is around `V09` that we end up address exposing -- the JIT is not able to see through the `AsImpl()` calls with full fidelity. If we change `AsImpl` to return by value instead of by ref then the problem is solved and the benchmark reduces to a vector constant. At the same time we can switch to `Unsafe.BitCast`. Does that seem reasonable @tannergooding ? `System.Numerics.Tests.Perf_Matrix3x2.AddOperatorBenchmark`, `System.Numerics.Tests.Perf_Matrix3x2.SubtractOperatorBenchmark`, `System.Numerics.Tests.Perf_Matrix3x2.AddBenchmark` and `System.Numerics.Tests.Perf_Matrix3x2.SubtractBenchmark` are all affected similarly.

jakobbotsch commented 1 year ago

System.Numerics.Tests.Perf_Matrix3x2.MultiplyByMatrixBenchmark

```diff @@ -128,126 +128,140 @@ G_M38613_IG03: ;; offset=0077H ; Final local variable assignments ; ;* V00 this [V00 ] ( 0, 0 ) ref -> zero-ref this class-hnd single-def -; V01 RetBuf [V01,T02] ( 6, 6 ) byref -> rdx single-def +; V01 RetBuf [V01,T01] ( 6, 6 ) byref -> rdx single-def ;# V02 OutArgs [V02 ] ( 1, 1 ) struct ( 0) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace" ;* V03 tmp1 [V03 ] ( 0, 0 ) struct (24) zero-ref do-not-enreg[S] "impAppendStmt" ;* V04 tmp2 [V04 ] ( 0, 0 ) struct (24) zero-ref do-not-enreg[S] "spilled call-like call argument" ;* V05 tmp3 [V05 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline stloc first use temp" ;* V06 tmp4 [V06 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline ldloca(s) first use temp" ;* V07 tmp5 [V07 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline stloc first use temp" -;* V08 tmp6 [V08 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline ldloca(s) first use temp" -; V09 tmp7 [V09 ] ( 4, 8 ) struct (24) [rsp+18H] do-not-enreg[XS] addr-exposed ld-addr-op "Inlining Arg" -; V10 tmp8 [V10,T00] ( 9, 18 ) struct (24) [rsp+00H] do-not-enreg[SF] ld-addr-op "Inlining Arg" -; V11 tmp9 [V11,T01] ( 7, 14 ) byref -> rax single-def "impAppendStmt" +; V08 tmp6 [V08 ] ( 9, 9 ) struct (24) [rsp+18H] do-not-enreg[SF] ld-addr-op "Inline ldloca(s) first use temp" +; V09 tmp7 [V09 ] ( 4, 8 ) struct (24) [rsp+00H] do-not-enreg[XS] addr-exposed ld-addr-op "Inlining Arg" +;* V10 tmp8 [V10 ] ( 0, 0 ) struct (24) zero-ref do-not-enreg[SF] ld-addr-op "Inlining Arg" +; V11 tmp9 [V11,T00] ( 7, 14 ) byref -> rax single-def "impAppendStmt" ;* V12 tmp10 [V12 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline stloc first use temp" ;* V13 tmp11 [V13 ] ( 0, 0 ) struct (24) zero-ref ld-addr-op "Inline ldloca(s) first use temp" -; V14 tmp12 [V14,T07] ( 2, 4 ) simd8 -> mm0 ld-addr-op "NewObj constructor temp" -; V15 tmp13 [V15,T08] ( 2, 4 ) simd8 -> mm2 ld-addr-op "NewObj constructor temp" -; V16 tmp14 [V16,T09] ( 2, 4 ) simd8 -> mm1 ld-addr-op "NewObj constructor temp" +; V14 tmp12 [V14,T09] ( 2, 4 ) simd8 -> mm6 ld-addr-op "NewObj constructor temp" +; V15 tmp13 [V15,T10] ( 2, 4 ) simd8 -> mm7 ld-addr-op "NewObj constructor temp" +; V16 tmp14 [V16,T11] ( 2, 4 ) simd8 -> mm0 ld-addr-op "NewObj constructor temp" ;* V17 tmp15 [V17 ] ( 0, 0 ) simd8 -> zero-ref V05.X(offs=0x00) P-INDEP "field V05.X (fldOffset=0x0)" ;* V18 tmp16 [V18 ] ( 0, 0 ) simd8 -> zero-ref V05.Y(offs=0x08) P-INDEP "field V05.Y (fldOffset=0x8)" ;* V19 tmp17 [V19 ] ( 0, 0 ) simd8 -> zero-ref V05.Z(offs=0x10) P-INDEP "field V05.Z (fldOffset=0x10)" -;* V20 tmp18 [V20,T21] ( 0, 0 ) simd8 -> zero-ref V06.X(offs=0x00) P-INDEP "field V06.X (fldOffset=0x0)" -;* V21 tmp19 [V21,T22] ( 0, 0 ) simd8 -> zero-ref V06.Y(offs=0x08) P-INDEP "field V06.Y (fldOffset=0x8)" -;* V22 tmp20 [V22,T23] ( 0, 0 ) simd8 -> zero-ref V06.Z(offs=0x10) P-INDEP "field V06.Z (fldOffset=0x10)" +;* V20 tmp18 [V20,T25] ( 0, 0 ) simd8 -> zero-ref V06.X(offs=0x00) P-INDEP "field V06.X (fldOffset=0x0)" +;* V21 tmp19 [V21,T26] ( 0, 0 ) simd8 -> zero-ref V06.Y(offs=0x08) P-INDEP "field V06.Y (fldOffset=0x8)" +;* V22 tmp20 [V22,T27] ( 0, 0 ) simd8 -> zero-ref V06.Z(offs=0x10) P-INDEP "field V06.Z (fldOffset=0x10)" ;* V23 tmp21 [V23 ] ( 0, 0 ) simd8 -> zero-ref V07.X(offs=0x00) P-INDEP "field V07.X (fldOffset=0x0)" ;* V24 tmp22 [V24 ] ( 0, 0 ) simd8 -> zero-ref V07.Y(offs=0x08) P-INDEP "field V07.Y (fldOffset=0x8)" ;* V25 tmp23 [V25 ] ( 0, 0 ) simd8 -> zero-ref V07.Z(offs=0x10) P-INDEP "field V07.Z (fldOffset=0x10)" -;* V26 tmp24 [V26,T24] ( 0, 0 ) simd8 -> zero-ref V08.X(offs=0x00) P-INDEP "field V08.X (fldOffset=0x0)" -;* V27 tmp25 [V27,T25] ( 0, 0 ) simd8 -> zero-ref V08.Y(offs=0x08) P-INDEP "field V08.Y (fldOffset=0x8)" -;* V28 tmp26 [V28,T26] ( 0, 0 ) simd8 -> zero-ref V08.Z(offs=0x10) P-INDEP "field V08.Z (fldOffset=0x10)" +; V26 tmp24 [V26,T02] ( 7, 7 ) simd8 -> [rsp+18H] do-not-enreg[S] V08.X(offs=0x00) P-DEP "field V08.X (fldOffset=0x0)" +; V27 tmp25 [V27,T03] ( 7, 7 ) simd8 -> [rsp+20H] do-not-enreg[S] V08.Y(offs=0x08) P-DEP "field V08.Y (fldOffset=0x8)" +; V28 tmp26 [V28,T04] ( 7, 7 ) simd8 -> [rsp+28H] do-not-enreg[S] V08.Z(offs=0x10) P-DEP "field V08.Z (fldOffset=0x10)" ;* V29 tmp27 [V29 ] ( 0, 0 ) simd8 -> zero-ref V12.X(offs=0x00) P-INDEP "field V12.X (fldOffset=0x0)" ;* V30 tmp28 [V30 ] ( 0, 0 ) simd8 -> zero-ref V12.Y(offs=0x08) P-INDEP "field V12.Y (fldOffset=0x8)" ;* V31 tmp29 [V31 ] ( 0, 0 ) simd8 -> zero-ref V12.Z(offs=0x10) P-INDEP "field V12.Z (fldOffset=0x10)" -; V32 tmp30 [V32,T18] ( 2, 2 ) simd8 -> mm0 V13.X(offs=0x00) P-INDEP "field V13.X (fldOffset=0x0)" -; V33 tmp31 [V33,T19] ( 2, 2 ) simd8 -> mm2 V13.Y(offs=0x08) P-INDEP "field V13.Y (fldOffset=0x8)" -; V34 tmp32 [V34,T20] ( 2, 2 ) simd8 -> mm1 V13.Z(offs=0x10) P-INDEP "field V13.Z (fldOffset=0x10)" -; V35 cse0 [V35,T10] ( 3, 3 ) float -> mm3 "CSE - aggressive" -; V36 cse1 [V36,T11] ( 3, 3 ) float -> mm2 "CSE - aggressive" -; V37 cse2 [V37,T12] ( 3, 3 ) float -> mm7 "CSE - aggressive" -; V38 cse3 [V38,T13] ( 3, 3 ) float -> mm3 "CSE - aggressive" -; V39 cse4 [V39,T14] ( 3, 3 ) float -> mm7 "CSE - aggressive" -; V40 cse5 [V40,T03] ( 4, 4 ) float -> mm1 "CSE - aggressive" -; V41 cse6 [V41,T04] ( 4, 4 ) float -> mm4 "CSE - aggressive" -; V42 cse7 [V42,T05] ( 4, 4 ) float -> mm5 "CSE - aggressive" -; V43 cse8 [V43,T06] ( 4, 4 ) float -> mm6 "CSE - aggressive" -;* V44 cse9 [V44,T15] ( 0, 0 ) simd8 -> zero-ref "CSE - aggressive" -;* V45 cse10 [V45,T16] ( 0, 0 ) simd8 -> zero-ref "CSE - aggressive" -; V46 cse11 [V46,T17] ( 3, 3 ) float -> mm0 "CSE - aggressive" +; V32 tmp30 [V32,T20] ( 2, 2 ) simd8 -> mm6 V13.X(offs=0x00) P-INDEP "field V13.X (fldOffset=0x0)" +; V33 tmp31 [V33,T21] ( 2, 2 ) simd8 -> mm7 V13.Y(offs=0x08) P-INDEP "field V13.Y (fldOffset=0x8)" +; V34 tmp32 [V34,T22] ( 2, 2 ) simd8 -> mm0 V13.Z(offs=0x10) P-INDEP "field V13.Z (fldOffset=0x10)" +; V35 tmp33 [V35,T05] ( 4, 4 ) float -> mm0 "V04.[000..004)" +; V36 tmp34 [V36,T06] ( 4, 4 ) float -> mm1 "V04.[004..008)" +; V37 tmp35 [V37,T07] ( 4, 4 ) float -> mm2 "V04.[008..012)" +; V38 tmp36 [V38,T08] ( 4, 4 ) float -> mm3 "V04.[012..016)" +; V39 tmp37 [V39,T23] ( 2, 2 ) float -> mm4 "V04.[016..020)" +; V40 tmp38 [V40,T24] ( 2, 2 ) float -> mm5 "V04.[020..024)" +;* V41 tmp39 [V41 ] ( 0, 0 ) float -> zero-ref "V10.[000..004)" +;* V42 tmp40 [V42 ] ( 0, 0 ) float -> zero-ref "V10.[004..008)" +;* V43 tmp41 [V43 ] ( 0, 0 ) float -> zero-ref "V10.[008..012)" +;* V44 tmp42 [V44 ] ( 0, 0 ) float -> zero-ref "V10.[012..016)" +;* V45 tmp43 [V45 ] ( 0, 0 ) float -> zero-ref "V10.[016..020)" +;* V46 tmp44 [V46 ] ( 0, 0 ) float -> zero-ref "V10.[020..024)" +; V47 cse0 [V47,T12] ( 3, 3 ) float -> mm8 "CSE - aggressive" +; V48 cse1 [V48,T13] ( 3, 3 ) float -> mm7 "CSE - aggressive" +; V49 cse2 [V49,T14] ( 3, 3 ) float -> mm9 "CSE - aggressive" +; V50 cse3 [V50,T15] ( 3, 3 ) float -> mm8 "CSE - aggressive" +; V51 cse4 [V51,T16] ( 3, 3 ) float -> mm9 "CSE - aggressive" +; V52 cse5 [V52,T17] ( 2, 2 ) simd8 -> mm0 "CSE - aggressive" +; V53 cse6 [V53,T18] ( 2, 2 ) simd8 -> mm1 "CSE - aggressive" +; V54 cse7 [V54,T19] ( 3, 3 ) float -> mm6 "CSE - aggressive" ; -; Lcl frame size = 104 +; Lcl frame size = 136 G_M38613_IG01: ;; offset=0000H - sub rsp, 104 + sub rsp, 136 vzeroupper - vmovaps xmmword ptr [rsp+50H], xmm6 - vmovaps xmmword ptr [rsp+40H], xmm7 - vmovaps xmmword ptr [rsp+30H], xmm8 - ;; size=25 bbWeight=1 PerfScore 7.25 -G_M38613_IG02: ;; offset=0019H + vmovaps xmmword ptr [rsp+70H], xmm6 + vmovaps xmmword ptr [rsp+60H], xmm7 + vmovaps xmmword ptr [rsp+50H], xmm8 + vmovaps xmmword ptr [rsp+40H], xmm9 + vmovaps xmmword ptr [rsp+30H], xmm10 + ;; size=40 bbWeight=1 PerfScore 11.25 +G_M38613_IG02: ;; offset=0028H vmovsd xmm0, qword ptr [reloc @RWD00] + vmovsd xmm1, qword ptr [reloc @RWD08] vmovsd qword ptr [rsp+18H], xmm0 - vmovsd xmm0, qword ptr [reloc @RWD08] - vmovsd qword ptr [rsp+20H], xmm0 + vmovsd qword ptr [rsp+20H], xmm1 vxorps xmm0, xmm0, xmm0 vmovsd qword ptr [rsp+28H], xmm0 - vmovsd xmm0, qword ptr [reloc @RWD00] - vmovsd qword ptr [rsp], xmm0 - vmovsd xmm0, qword ptr [reloc @RWD08] - vmovsd qword ptr [rsp+08H], xmm0 - vxorps xmm0, xmm0, xmm0 - vmovsd qword ptr [rsp+10H], xmm0 - lea rax, bword ptr [rsp+18H] - vmovss xmm0, dword ptr [rax] - vmovss xmm1, dword ptr [rsp] - vmulss xmm2, xmm0, xmm1 - vmovss xmm3, dword ptr [rax+04H] - vmovss xmm4, dword ptr [rsp+08H] - vmulss xmm5, xmm3, xmm4 - vaddss xmm2, xmm2, xmm5 - vmovss xmm5, dword ptr [rsp+04H] - vmulss xmm0, xmm0, xmm5 - vmovss xmm6, dword ptr [rsp+0CH] - vmulss xmm3, xmm3, xmm6 - vaddss xmm0, xmm0, xmm3 - vinsertps xmm0, xmm2, xmm0, 28 - vmovss xmm2, dword ptr [rax+08H] - vmulss xmm3, xmm2, xmm1 - vmovss xmm7, dword ptr [rax+0CH] - vmulss xmm8, xmm7, xmm4 - vaddss xmm3, xmm3, xmm8 - vmulss xmm2, xmm2, xmm5 - vmulss xmm7, xmm7, xmm6 - vaddss xmm2, xmm2, xmm7 - vinsertps xmm2, xmm3, xmm2, 28 - vmovss xmm3, dword ptr [rax+10H] - vmulss xmm1, xmm3, xmm1 - vmovss xmm7, dword ptr [rax+14H] - vmulss xmm4, xmm7, xmm4 - vaddss xmm1, xmm1, xmm4 - vaddss xmm1, xmm1, dword ptr [rsp+10H] - vmulss xmm3, xmm3, xmm5 - vmulss xmm4, xmm7, xmm6 - vaddss xmm3, xmm3, xmm4 - vaddss xmm3, xmm3, dword ptr [rsp+14H] - vinsertps xmm1, xmm1, xmm3, 28 - vmovsd qword ptr [rdx], xmm0 - vmovsd qword ptr [rdx+08H], xmm2 - vmovsd qword ptr [rdx+10H], xmm1 + vmovss xmm0, dword ptr [rsp+18H] + vmovss xmm1, dword ptr [rsp+1CH] + vmovss xmm2, dword ptr [rsp+20H] + vmovss xmm3, dword ptr [rsp+24H] + vmovss xmm4, dword ptr [rsp+28H] + vmovss xmm5, dword ptr [rsp+2CH] + vmovsd xmm6, qword ptr [reloc @RWD00] + vmovsd qword ptr [rsp], xmm6 + vmovsd xmm6, qword ptr [reloc @RWD08] + vmovsd qword ptr [rsp+08H], xmm6 + vxorps xmm6, xmm6, xmm6 + vmovsd qword ptr [rsp+10H], xmm6 + lea rax, bword ptr [rsp] + vmovss xmm6, dword ptr [rax] + vmulss xmm7, xmm6, xmm0 + vmovss xmm8, dword ptr [rax+04H] + vmulss xmm9, xmm8, xmm2 + vaddss xmm7, xmm7, xmm9 + vmulss xmm6, xmm6, xmm1 + vmulss xmm8, xmm8, xmm3 + vaddss xmm6, xmm6, xmm8 + vinsertps xmm6, xmm7, xmm6, 28 + vmovss xmm7, dword ptr [rax+08H] + vmulss xmm8, xmm7, xmm0 + vmovss xmm9, dword ptr [rax+0CH] + vmulss xmm10, xmm9, xmm2 + vaddss xmm8, xmm8, xmm10 + vmulss xmm7, xmm7, xmm1 + vmulss xmm9, xmm9, xmm3 + vaddss xmm7, xmm7, xmm9 + vinsertps xmm7, xmm8, xmm7, 28 + vmovss xmm8, dword ptr [rax+10H] + vmulss xmm0, xmm8, xmm0 + vmovss xmm9, dword ptr [rax+14H] + vmulss xmm2, xmm9, xmm2 + vaddss xmm0, xmm0, xmm2 + vaddss xmm0, xmm0, xmm4 + vmulss xmm1, xmm8, xmm1 + vmulss xmm2, xmm9, xmm3 + vaddss xmm1, xmm1, xmm2 + vaddss xmm1, xmm1, xmm5 + vinsertps xmm0, xmm0, xmm1, 28 + vmovsd qword ptr [rdx], xmm6 + vmovsd qword ptr [rdx+08H], xmm7 + vmovsd qword ptr [rdx+10H], xmm0 mov rax, rdx - ;; size=252 bbWeight=1 PerfScore 128.42 -G_M38613_IG03: ;; offset=0115H - vmovaps xmm6, xmmword ptr [rsp+50H] - vmovaps xmm7, xmmword ptr [rsp+40H] - vmovaps xmm8, xmmword ptr [rsp+30H] - add rsp, 104 + ;; size=263 bbWeight=1 PerfScore 130.42 +G_M38613_IG03: ;; offset=012FH + vmovaps xmm6, xmmword ptr [rsp+70H] + vmovaps xmm7, xmmword ptr [rsp+60H] + vmovaps xmm8, xmmword ptr [rsp+50H] + vmovaps xmm9, xmmword ptr [rsp+40H] + vmovaps xmm10, xmmword ptr [rsp+30H] + add rsp, 136 ret - ;; size=23 bbWeight=1 PerfScore 13.25 + ;; size=38 bbWeight=1 PerfScore 21.25 RWD00 dq 000000003F800000h RWD08 dq 3F80000000000000h -; Total bytes of code 300, prolog size 25, PerfScore 178.92, instruction count 60, allocated bytes for code 300 (MethodHash=f176692a) for method Program:MultiplyByMatrixOperatorBenchmark():Program+Matrix3x2:this +; Total bytes of code 341, prolog size 40, PerfScore 197.02, instruction count 66, allocated bytes for code 341 (MethodHash=f176692a) for method Program:MultiplyByMatrixOperatorBenchmark():Program+Matrix3x2:this -499.7 ms +555.9 ms ``` We need some more registers and also see the same kind of pipelining change as in the previous comment, but in addition we also DNER `V08` due to #87554. `System.Numerics.Tests.Perf_Matrix3x2.MultiplyByMatrixOperatorBenchmark` is similarly affected.

jakobbotsch commented 1 year ago

Promoting TYP_SIMD32 and TYP_SIMD64 fields can be very expensive if we end up creating long lifetimes that span across calls where the upper halves need to be saved/restored. For example: https://gist.github.com/jakobbotsch/e09b0e75ecfac6934ae51c8902748491

The pass does not currently have the necessary information to try to take this into account, so need to think about what to do here.

jakobbotsch commented 1 year ago

Looking at this perfscore regression:

306966.04 ( 0.29% of base) : 53160.dasm - Benchmarks.SIMD.RayTracer.RayTracer:RenderSequential(Benchmarks.SIMD.RayTracer.Scene,int[]):this

```diff @@ -19,7 +19,7 @@ ;* V07 loc4 [V07 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op ; V08 OutArgs [V08 ] ( 1, 1 ) struct (40) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace" ;* V09 tmp1 [V09 ] ( 0, 0 ) struct (16) zero-ref "impAppendStmt" -; V10 tmp2 [V10,T03] ( 4,1668919.51) struct (24) [rsp+90H] do-not-enreg[S] ld-addr-op "NewObj constructor temp" +;* V10 tmp2 [V10 ] ( 0, 0 ) struct (24) zero-ref do-not-enreg[S] ld-addr-op "NewObj constructor temp" ;* V11 tmp3 [V11 ] ( 0, 0 ) struct (16) zero-ref "spilled call-like call argument" ;* V12 tmp4 [V12 ] ( 0, 0 ) int -> zero-ref "Strict ordering of exceptions for Array store" ;* V13 tmp5 [V13 ] ( 0, 0 ) double -> zero-ref "Inlining Arg" @@ -31,37 +31,37 @@ ;* V19 tmp11 [V19 ] ( 0, 0 ) struct (16) zero-ref "spilled call-like call argument" ;* V20 tmp12 [V20 ] ( 0, 0 ) double -> zero-ref "Inlining Arg" ;* V21 tmp13 [V21 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op "Inline ldloca(s) first use temp" -; V22 tmp14 [V22,T36] ( 2, 834459.76) double -> mm1 "Inlining Arg" +; V22 tmp14 [V22,T38] ( 2, 834459.76) double -> mm1 "Inlining Arg" ;* V23 tmp15 [V23 ] ( 0, 0 ) struct (16) zero-ref "Inlining Arg" ;* V24 tmp16 [V24 ] ( 0, 0 ) double -> zero-ref "Inlining Arg" ;* V25 tmp17 [V25 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op "Inline ldloca(s) first use temp" -; V26 tmp18 [V26,T37] ( 2, 834459.76) double -> mm3 "Inlining Arg" +; V26 tmp18 [V26,T39] ( 2, 834459.76) double -> mm2 "Inlining Arg" ;* V27 tmp19 [V27 ] ( 0, 0 ) struct (16) zero-ref "Inlining Arg" ;* V28 tmp20 [V28 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op "Inline ldloca(s) first use temp" ;* V29 tmp21 [V29 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op "Inline ldloca(s) first use temp" -; V30 tmp22 [V30,T46] ( 3, 625844.82) float -> mm1 "Inline stloc first use temp" -; V31 tmp23 [V31,T50] ( 3, 417229.88) float -> mm9 +; V30 tmp22 [V30,T44] ( 3, 625844.82) float -> mm1 "Inline stloc first use temp" +; V31 tmp23 [V31,T48] ( 3, 417229.88) float -> mm9 ;* V32 tmp24 [V32 ] ( 0, 0 ) float -> zero-ref "Inline stloc first use temp" ;* V33 tmp25 [V33 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op "Inline ldloca(s) first use temp" ;* V34 tmp26 [V34 ] ( 0, 0 ) double -> zero-ref "Inlining Arg" -; V35 tmp27 [V35 ] ( 3, 417229.88) struct (16) [rsp+80H] do-not-enreg[XS] must-init addr-exposed "Inline return value spill temp" -; V36 tmp28 [V36,T04] ( 3,1661092.31) struct (24) [rsp+68H] do-not-enreg[S] "Inlining Arg" +; V35 tmp27 [V35 ] ( 3, 417229.88) struct (16) [rsp+88H] do-not-enreg[XS] must-init addr-exposed "Inline return value spill temp" +; V36 tmp28 [V36,T07] ( 3,1418003.17) struct (24) [rsp+70H] do-not-enreg[S] "Inlining Arg" ; V37 tmp29 [V37,T21] ( 3, 546589.36) ref -> r8 class-hnd "Inline stloc first use temp" ;* V38 tmp30 [V38 ] ( 0, 0 ) ref -> zero-ref class-hnd "Inline return value spill temp" ; V39 tmp31 [V39,T20] ( 5, 577728.02) ref -> r13 class-hnd "Inline stloc first use temp" -; V40 tmp32 [V40,T10] ( 3,1039161.09) ref -> [rsp+38H] class-hnd spill-single-def "Inline stloc first use temp" -; V41 tmp33 [V41,T01] ( 5,2696339.80) int -> [rsp+64H] "Inline stloc first use temp" -; V42 tmp34 [V42,T02] ( 4,1865793.64) ref -> r8 class-hnd "Inline stloc first use temp" +; V40 tmp32 [V40,T10] ( 3,1039161.09) ref -> [rsp+40H] class-hnd spill-single-def "Inline stloc first use temp" +; V41 tmp33 [V41,T01] ( 5,2696339.80) int -> [rsp+6CH] "Inline stloc first use temp" +; V42 tmp34 [V42,T03] ( 4,1865793.64) ref -> r8 class-hnd "Inline stloc first use temp" ; V43 tmp35 [V43,T08] ( 4,1324112.02) ref -> r8 class-hnd "Inline stloc first use temp" -; V44 tmp36 [V44,T05] ( 4,1531185.21) ref -> r8 "guarded devirt return temp" -; V45 tmp37 [V45,T06] ( 4,1484307.13) ref -> [rsp+30H] class-hnd exact spill-single-def "guarded devirt this exact temp" +; V44 tmp36 [V44,T04] ( 4,1531185.21) ref -> r8 "guarded devirt return temp" +; V45 tmp37 [V45,T05] ( 4,1484307.13) ref -> [rsp+38H] class-hnd exact spill-single-def "guarded devirt this exact temp" ;* V46 tmp38 [V46 ] ( 0, 0 ) struct (16) zero-ref "Inline stloc first use temp" -; V47 tmp39 [V47,T32] ( 4,1208697.08) float -> mm11 "Inline stloc first use temp" +; V47 tmp39 [V47,T34] ( 4,1208697.08) float -> mm11 "Inline stloc first use temp" ;* V48 tmp40 [V48 ] ( 0, 0 ) double -> zero-ref "impAppendStmt" -; V49 tmp41 [V49,T44] ( 3, 765017.45) double -> mm0 "Inline stloc first use temp" -; V50 tmp42 [V50,T45] ( 3, 713112.43) float -> mm10 -; V51 tmp43 [V51,T31] ( 3,1156792.06) float -> mm10 "Inline stloc first use temp" -; V52 tmp44 [V52,T00] ( 5,3855973.53) ref -> [rsp+28H] class-hnd exact spill-single-def "NewObj constructor temp" +; V49 tmp41 [V49,T42] ( 3, 765017.45) double -> mm0 "Inline stloc first use temp" +; V50 tmp42 [V50,T43] ( 3, 713112.43) float -> mm10 +; V51 tmp43 [V51,T33] ( 3,1156792.06) float -> mm10 "Inline stloc first use temp" +; V52 tmp44 [V52,T00] ( 5,3855973.53) ref -> [rsp+30H] class-hnd exact spill-single-def "NewObj constructor temp" ;* V53 tmp45 [V53 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op "Inline ldloca(s) first use temp" ;* V54 tmp46 [V54 ] ( 0, 0 ) struct (16) zero-ref "Inlining Arg" ;* V55 tmp47 [V55 ] ( 0, 0 ) struct (16) zero-ref "Inlining Arg" @@ -78,44 +78,47 @@ ; V66 tmp58 [V66,T24] ( 3, 417229.88) int -> rax "Inline return value spill temp" ;* V67 tmp59 [V67 ] ( 0, 0 ) float -> zero-ref "Inlining Arg" ; V68 tmp60 [V68,T18] ( 3, 625844.82) int -> rax "Inline stloc first use temp" -; V69 tmp61 [V69,T33] ( 4, 834459.76) simd12 -> mm0 V07._simdVector(offs=0x00) P-INDEP "field V07._simdVector (fldOffset=0x0)" -; V70 tmp62 [V70,T51] ( 2, 417229.88) simd12 -> mm2 V09._simdVector(offs=0x00) P-INDEP "field V09._simdVector (fldOffset=0x0)" +; V69 tmp61 [V69,T35] ( 4, 834459.76) simd12 -> mm0 V07._simdVector(offs=0x00) P-INDEP "field V07._simdVector (fldOffset=0x0)" +; V70 tmp62 [V70,T49] ( 2, 417229.88) simd12 -> mm7 V09._simdVector(offs=0x00) P-INDEP "field V09._simdVector (fldOffset=0x0)" ;* V71 tmp63 [V71 ] ( 0, 0 ) simd12 -> zero-ref V11._simdVector(offs=0x00) P-INDEP "field V11._simdVector (fldOffset=0x0)" -; V72 tmp64 [V72,T52] ( 2, 417229.88) simd12 -> mm0 V14._simdVector(offs=0x00) P-INDEP "field V14._simdVector (fldOffset=0x0)" +; V72 tmp64 [V72,T50] ( 2, 417229.88) simd12 -> mm0 V14._simdVector(offs=0x00) P-INDEP "field V14._simdVector (fldOffset=0x0)" ;* V73 tmp65 [V73 ] ( 0, 0 ) simd12 -> zero-ref V16._simdVector(offs=0x00) P-INDEP "field V16._simdVector (fldOffset=0x0)" ;* V74 tmp66 [V74 ] ( 0, 0 ) simd12 -> zero-ref V17._simdVector(offs=0x00) P-INDEP "field V17._simdVector (fldOffset=0x0)" ;* V75 tmp67 [V75 ] ( 0, 0 ) simd12 -> zero-ref V18._simdVector(offs=0x00) P-INDEP "field V18._simdVector (fldOffset=0x0)" -; V76 tmp68 [V76,T34] ( 4, 834459.76) simd12 -> mm0 V19._simdVector(offs=0x00) P-INDEP "field V19._simdVector (fldOffset=0x0)" -; V77 tmp69 [V77,T53] ( 2, 417229.88) simd12 -> mm1 V21._simdVector(offs=0x00) P-INDEP "field V21._simdVector (fldOffset=0x0)" -; V78 tmp70 [V78,T54] ( 2, 417229.88) simd12 -> mm4 V23._simdVector(offs=0x00) P-INDEP "field V23._simdVector (fldOffset=0x0)" -; V79 tmp71 [V79,T55] ( 2, 417229.88) simd12 -> mm3 V25._simdVector(offs=0x00) P-INDEP "field V25._simdVector (fldOffset=0x0)" -; V80 tmp72 [V80,T56] ( 2, 417229.88) simd12 -> mm4 V27._simdVector(offs=0x00) P-INDEP "field V27._simdVector (fldOffset=0x0)" -; V81 tmp73 [V81,T57] ( 2, 417229.88) simd12 -> mm1 V28._simdVector(offs=0x00) P-INDEP "field V28._simdVector (fldOffset=0x0)" -; V82 tmp74 [V82,T58] ( 2, 417229.88) simd12 -> mm0 V29._simdVector(offs=0x00) P-INDEP "field V29._simdVector (fldOffset=0x0)" -; V83 tmp75 [V83,T59] ( 2, 417229.88) simd12 -> mm0 V33._simdVector(offs=0x00) P-INDEP "field V33._simdVector (fldOffset=0x0)" -; V84 tmp76 [V84 ] ( 3, 417229.88) simd12 -> [rsp+80H] do-not-enreg[XS] addr-exposed V35._simdVector(offs=0x00) P-DEP "field V35._simdVector (fldOffset=0x0)" +; V76 tmp68 [V76,T36] ( 4, 834459.76) simd12 -> mm0 V19._simdVector(offs=0x00) P-INDEP "field V19._simdVector (fldOffset=0x0)" +; V77 tmp69 [V77,T51] ( 2, 417229.88) simd12 -> mm1 V21._simdVector(offs=0x00) P-INDEP "field V21._simdVector (fldOffset=0x0)" +; V78 tmp70 [V78,T52] ( 2, 417229.88) simd12 -> mm3 V23._simdVector(offs=0x00) P-INDEP "field V23._simdVector (fldOffset=0x0)" +; V79 tmp71 [V79,T53] ( 2, 417229.88) simd12 -> mm2 V25._simdVector(offs=0x00) P-INDEP "field V25._simdVector (fldOffset=0x0)" +; V80 tmp72 [V80,T54] ( 2, 417229.88) simd12 -> mm3 V27._simdVector(offs=0x00) P-INDEP "field V27._simdVector (fldOffset=0x0)" +; V81 tmp73 [V81,T55] ( 2, 417229.88) simd12 -> mm1 V28._simdVector(offs=0x00) P-INDEP "field V28._simdVector (fldOffset=0x0)" +; V82 tmp74 [V82,T56] ( 2, 417229.88) simd12 -> mm0 V29._simdVector(offs=0x00) P-INDEP "field V29._simdVector (fldOffset=0x0)" +; V83 tmp75 [V83,T57] ( 2, 417229.88) simd12 -> mm9 V33._simdVector(offs=0x00) P-INDEP "field V33._simdVector (fldOffset=0x0)" +; V84 tmp76 [V84 ] ( 3, 417229.88) simd12 -> [rsp+88H] do-not-enreg[XS] addr-exposed V35._simdVector(offs=0x00) P-DEP "field V35._simdVector (fldOffset=0x0)" ; V85 tmp77 [V85,T29] ( 4,1426224.85) simd12 -> mm10 V46._simdVector(offs=0x00) P-INDEP "field V46._simdVector (fldOffset=0x0)" ; V86 tmp78 [V86,T40] ( 2, 771194.71) simd12 -> mm10 V53._simdVector(offs=0x00) P-INDEP "field V53._simdVector (fldOffset=0x0)" ; V87 tmp79 [V87,T41] ( 2, 771194.71) simd12 -> mm0 V54._simdVector(offs=0x00) P-INDEP "field V54._simdVector (fldOffset=0x0)" -; V88 tmp80 [V88,T42] ( 2, 771194.71) simd12 -> mm1 V55._simdVector(offs=0x00) P-INDEP "field V55._simdVector (fldOffset=0x0)" -; V89 tmp81 [V89,T43] ( 2, 771194.71) simd12 -> mm0 V56._simdVector(offs=0x00) P-INDEP "field V56._simdVector (fldOffset=0x0)" -;* V90 tmp82 [V90,T61] ( 0, 0 ) simd12 -> zero-ref V58._simdVector(offs=0x00) P-INDEP "field V58._simdVector (fldOffset=0x0)" -; V91 tmp83 [V91 ] ( 2, 945335.45) struct (24) [rsp+48H] do-not-enreg[XS] addr-exposed "by-value struct argument" -; V92 tmp84 [V92,T07] ( 3,1418003.17) ref -> r8 "argument with side effect" -; V93 cse0 [V93,T38] ( 3, 802827.23) simd12 -> mm9 "CSE - aggressive" -; V94 cse1 [V94,T47] ( 3, 625844.82) double -> mm1 "CSE - moderate" -; V95 cse2 [V95,T48] ( 3, 625844.82) double -> mm4 "CSE - moderate" -; V96 cse3 [V96,T60] ( 2, 209609.18) double -> mm6 "CSE - conservative" -; V97 cse4 [V97,T39] ( 3, 802827.23) simd12 -> mm7 "CSE - aggressive" -; V98 cse5 [V98,T28] ( 3, 2997.00) int -> rax "CSE - conservative" -; V99 cse6 [V99,T30] ( 5,1280874.96) double -> mm8 "CSE - aggressive" -; V100 cse7 [V100,T11] ( 3,1039161.09) int -> [rsp+44H] spill-single-def "CSE - aggressive" -; V101 cse8 [V101,T35] ( 4, 834459.76) float -> mm2 "CSE - aggressive" -; V102 cse9 [V102,T49] ( 3, 625844.82) double -> mm3 "CSE - moderate" -; V103 cse10 [V103,T19] ( 3, 625844.82) int -> rcx "CSE - moderate" -; TEMP_01 double -> [rsp+0xA8] +;* V88 tmp80 [V88 ] ( 0, 0 ) simd12 -> zero-ref V55._simdVector(offs=0x00) P-INDEP "field V55._simdVector (fldOffset=0x0)" +;* V89 tmp81 [V89 ] ( 0, 0 ) simd12 -> zero-ref V56._simdVector(offs=0x00) P-INDEP "field V56._simdVector (fldOffset=0x0)" +;* V90 tmp82 [V90,T59] ( 0, 0 ) simd12 -> zero-ref V58._simdVector(offs=0x00) P-INDEP "field V58._simdVector (fldOffset=0x0)" +;* V91 tmp83 [V91 ] ( 0, 0 ) simd12 -> zero-ref "V10.[000..012)" +;* V92 tmp84 [V92 ] ( 0, 0 ) simd12 -> zero-ref "V10.[012..024)" +; V93 tmp85 [V93,T31] ( 4,1216143.51) simd12 -> mm7 "V36.[000..012)" +; V94 tmp86 [V94,T32] ( 4,1216143.51) simd12 -> mm9 "V36.[012..024)" +; V95 tmp87 [V95,T02] ( 3,2313584.12) byref -> rcx "Spilling address for field-by-field copy" +; V96 tmp88 [V96 ] ( 2, 945335.45) struct (24) [rsp+50H] do-not-enreg[XS] addr-exposed "by-value struct argument" +; V97 tmp89 [V97,T06] ( 3,1418003.17) ref -> r8 "argument with side effect" +; V98 cse0 [V98,T45] ( 3, 625844.82) double -> mm1 "CSE - moderate" +; V99 cse1 [V99,T46] ( 3, 625844.82) double -> mm3 "CSE - moderate" +; V100 cse2 [V100,T58] ( 2, 209609.18) double -> mm6 "CSE - conservative" +; V101 cse3 [V101,T28] ( 3, 2997.00) int -> rax "CSE - conservative" +; V102 cse4 [V102,T30] ( 5,1280874.96) double -> mm8 "CSE - aggressive" +; V103 cse5 [V103,T11] ( 3,1039161.09) int -> [rsp+4CH] spill-single-def "CSE - aggressive" +; V104 cse6 [V104,T37] ( 4, 834459.76) float -> mm2 "CSE - moderate" +; V105 cse7 [V105,T47] ( 3, 625844.82) double -> mm2 "CSE - moderate" +; V106 cse8 [V106,T19] ( 3, 625844.82) int -> rcx "CSE - moderate" +; TEMP_01 double -> [rsp+0x98] ; -; Lcl frame size = 280 +; Lcl frame size = 264 G_M31648_IG01: ;; offset=0000H push r15 @@ -126,17 +129,17 @@ G_M31648_IG01: ;; offset=0000H push rsi push rbp push rbx - sub rsp, 280 + sub rsp, 264 vzeroupper - vmovaps xmmword ptr [rsp+100H], xmm6 - vmovaps xmmword ptr [rsp+F0H], xmm7 - vmovaps xmmword ptr [rsp+E0H], xmm8 - vmovaps xmmword ptr [rsp+D0H], xmm9 - vmovaps xmmword ptr [rsp+C0H], xmm10 - vmovaps xmmword ptr [rsp+B0H], xmm11 + vmovaps xmmword ptr [rsp+F0H], xmm6 + vmovaps xmmword ptr [rsp+E0H], xmm7 + vmovaps xmmword ptr [rsp+D0H], xmm8 + vmovaps xmmword ptr [rsp+C0H], xmm9 + vmovaps xmmword ptr [rsp+B0H], xmm10 + vmovaps xmmword ptr [rsp+A0H], xmm11 xor eax, eax - mov qword ptr [rsp+80H], rax mov qword ptr [rsp+88H], rax + mov qword ptr [rsp+90H], rax mov rsi, rcx mov rbx, rdx mov rdi, r8 @@ -162,213 +165,203 @@ G_M31648_IG04: ;; offset=008CH G_M31648_IG05: ;; offset=0094H vmovsd xmm7, qword ptr [r15+08H] vinsertps xmm7, xmm7, dword ptr [r15+10H], 40 - vmovaps xmm2, xmm7 vmovsd xmm0, qword ptr [r15+14H] vinsertps xmm0, xmm0, dword ptr [r15+1CH], 40 vxorps xmm1, xmm1, xmm1 vcvtsi2sd xmm1, xmm1, dword ptr [rsi+20H] - vmovsd xmm3, qword ptr [reloc @RWD00] - vmulsd xmm4, xmm1, xmm3 - vxorps xmm5, xmm5, xmm5 - vcvtsi2sd xmm5, xmm5, r12d - vsubsd xmm4, xmm5, xmm4 + vmovsd xmm2, qword ptr [reloc @RWD00] + vmulsd xmm3, xmm1, xmm2 + vxorps xmm4, xmm4, xmm4 + vcvtsi2sd xmm4, xmm4, r12d + vsubsd xmm3, xmm4, xmm3 vmovsd xmm8, qword ptr [reloc @RWD08] vmulsd xmm1, xmm1, xmm8 - vdivsd xmm1, xmm4, xmm1 - vmovsd xmm4, qword ptr [r15+2CH] - vinsertps xmm4, xmm4, dword ptr [r15+34H], 40 + vdivsd xmm1, xmm3, xmm1 + vmovsd xmm3, qword ptr [r15+2CH] + vinsertps xmm3, xmm3, dword ptr [r15+34H], 40 vcvtsd2ss xmm1, xmm1, xmm1 vbroadcastss xmm1, xmm1 - vmulps xmm1, xmm1, xmm4 - vxorps xmm4, xmm4, xmm4 - vcvtsi2sd xmm4, xmm4, dword ptr [rsi+24H] - vmulsd xmm3, xmm4, xmm3 - vsubsd xmm3, xmm6, xmm3 - vxorps xmm3, xmm3, xmmword ptr [reloc @RWD16] - vmulsd xmm4, xmm4, xmm8 - vdivsd xmm3, xmm3, xmm4 - vmovsd xmm4, qword ptr [r15+20H] - vinsertps xmm4, xmm4, dword ptr [r15+28H], 40 - vcvtsd2ss xmm3, xmm3, xmm3 - vbroadcastss xmm3, xmm3 - vmulps xmm3, xmm3, xmm4 - vaddps xmm1, xmm1, xmm3 + vmulps xmm1, xmm1, xmm3 + vxorps xmm3, xmm3, xmm3 + vcvtsi2sd xmm3, xmm3, dword ptr [rsi+24H] + vmulsd xmm2, xmm3, xmm2 + vsubsd xmm2, xmm6, xmm2 + vxorps xmm2, xmm2, xmmword ptr [reloc @RWD16] + vmulsd xmm3, xmm3, xmm8 + vdivsd xmm2, xmm2, xmm3 + vmovsd xmm3, qword ptr [r15+20H] + vinsertps xmm3, xmm3, dword ptr [r15+28H], 40 + vcvtsd2ss xmm2, xmm2, xmm2 + vbroadcastss xmm2, xmm2 + vmulps xmm2, xmm2, xmm3 + vaddps xmm1, xmm1, xmm2 vaddps xmm0, xmm0, xmm1 vdpps xmm1, xmm0, xmm0, 127 vcvtss2sd xmm1, xmm1, xmm1 vsqrtsd xmm1, xmm1, xmm1 vcvtsd2ss xmm1, xmm1, xmm1 - vxorps xmm3, xmm3, xmm3 - vucomiss xmm1, xmm3 + vxorps xmm2, xmm2, xmm2 + vucomiss xmm1, xmm2 jp SHORT G_M31648_IG06 je G_M31648_IG33 - ;; size=209 bbWeight=208614.94 PerfScore 33708697.29 -G_M31648_IG06: ;; offset=0165H - vmovss xmm3, dword ptr [reloc @RWD32] - vdivss xmm9, xmm3, xmm1 + ;; size=205 bbWeight=208614.94 PerfScore 33656543.55 +G_M31648_IG06: ;; offset=0161H + vmovss xmm2, dword ptr [reloc @RWD32] + vdivss xmm9, xmm2, xmm1 ;; size=12 bbWeight=208614.94 PerfScore 2711994.21 -G_M31648_IG07: ;; offset=0171H +G_M31648_IG07: ;; offset=016DH vcvtss2sd xmm1, xmm1, xmm9 vcvtsd2ss xmm1, xmm1, xmm1 vbroadcastss xmm1, xmm1 vmulps xmm9, xmm1, xmm0 - vmovaps xmm0, xmm9 - vxorps xmm1, xmm1, xmm1 - vmovdqu xmmword ptr [rsp+90H], xmm1 - vmovdqu xmmword ptr [rsp+98H], xmm1 - vmovsd qword ptr [rsp+90H], xmm2 - vextractps dword ptr [rsp+98H], xmm2, 2 - vmovsd qword ptr [rsp+9CH], xmm0 - vextractps dword ptr [rsp+A4H], xmm0, 2 - vmovdqu xmm0, xmmword ptr [rsp+90H] - vmovdqu xmmword ptr [rsp+68H], xmm0 - mov rax, qword ptr [rsp+A0H] - mov qword ptr [rsp+78H], rax xor r13, r13 mov rax, gword ptr [rbx+08H] - mov gword ptr [rsp+38H], rax + mov gword ptr [rsp+40H], rax xor edx, edx mov ecx, dword ptr [rax+08H] - mov dword ptr [rsp+44H], ecx + mov dword ptr [rsp+4CH], ecx test ecx, ecx jle G_M31648_IG22 - ;; size=142 bbWeight=208614.94 PerfScore 7579676.13 -G_M31648_IG08: ;; offset=01FFH - mov dword ptr [rsp+64H], edx + ;; size=47 bbWeight=208614.94 PerfScore 4120145.05 +G_M31648_IG08: ;; offset=019CH + mov dword ptr [rsp+6CH], edx mov r8d, edx mov r8, gword ptr [rax+8*r8+10H] mov r9, 0x7FF8687322D8 ; Benchmarks.SIMD.RayTracer.Sphere cmp qword ptr [r8], r9 jne G_M31648_IG15 ;; size=31 bbWeight=621931.21 PerfScore 4664484.11 -G_M31648_IG09: ;; offset=021EH - mov gword ptr [rsp+30H], r8 +G_M31648_IG09: ;; offset=01BBH + mov gword ptr [rsp+38H], r8 vmovsd xmm0, qword ptr [r8+14H] vinsertps xmm0, xmm0, dword ptr [r8+1CH], 40 - vmovaps xmm1, xmm7 - vsubps xmm10, xmm0, xmm1 - vmovaps xmm0, xmm9 - vdpps xmm11, xmm10, xmm0, 127 + vsubps xmm10, xmm0, xmm7 + vdpps xmm11, xmm10, xmm9, 127 vxorps xmm0, xmm0, xmm0 vucomiss xmm0, xmm11 ja SHORT G_M31648_IG14 - ;; size=48 bbWeight=385597.35 PerfScore 10346862.32 -G_M31648_IG10: ;; offset=024EH + ;; size=39 bbWeight=385597.35 PerfScore 10154063.64 +G_M31648_IG10: ;; offset=01E2H vcvtss2sd xmm0, xmm0, dword ptr [r8+10H] vmovaps xmm1, xmm8 call - vmovsd qword ptr [rsp+A8H], xmm0 + vmovsd qword ptr [rsp+98H], xmm0 vcvtss2sd xmm0, xmm0, xmm11 vmovaps xmm1, xmm8 call vdpps xmm1, xmm10, xmm10, 127 vcvtss2sd xmm1, xmm1, xmm1 vsubsd xmm0, xmm1, xmm0 - vmovsd xmm1, qword ptr [rsp+A8H] + vmovsd xmm1, qword ptr [rsp+98H] vsubsd xmm0, xmm1, xmm0 vxorps xmm1, xmm1, xmm1 vucomisd xmm1, xmm0 ja SHORT G_M31648_IG12 ;; size=77 bbWeight=327515.07 PerfScore 14028562.26 -G_M31648_IG11: ;; offset=029BH +G_M31648_IG11: ;; offset=022FH vsqrtsd xmm0, xmm0, xmm0 vcvtsd2ss xmm0, xmm0, xmm0 vsubss xmm10, xmm11, xmm0 jmp SHORT G_M31648_IG13 ;; size=14 bbWeight=109987.30 PerfScore 2309733.36 -G_M31648_IG12: ;; offset=02A9H +G_M31648_IG12: ;; offset=023DH vxorps xmm10, xmm10, xmm10 ;; size=5 bbWeight=217527.77 PerfScore 72509.26 -G_M31648_IG13: ;; offset=02AEH +G_M31648_IG13: ;; offset=0242H vxorps xmm0, xmm0, xmm0 vucomiss xmm10, xmm0 jp SHORT G_M31648_IG18 jne SHORT G_M31648_IG18 ;; size=12 bbWeight=385597.35 PerfScore 1670921.87 -G_M31648_IG14: ;; offset=02BAH +G_M31648_IG14: ;; offset=024EH xor r8, r8 jmp SHORT G_M31648_IG16 ;; size=5 bbWeight=287322.78 PerfScore 646476.25 -G_M31648_IG15: ;; offset=02BFH - vmovdqu xmm0, xmmword ptr [rsp+68H] - vmovdqu xmmword ptr [rsp+48H], xmm0 - mov r9, qword ptr [rsp+78H] - mov qword ptr [rsp+58H], r9 +G_M31648_IG15: ;; offset=0253H + vmovsd qword ptr [rsp+70H], xmm7 + vextractps dword ptr [rsp+78H], xmm7, 2 + vmovsd qword ptr [rsp+7CH], xmm9 + vextractps dword ptr [rsp+84H], xmm9, 2 + vmovdqu xmm0, xmmword ptr [rsp+70H] + vmovdqu xmmword ptr [rsp+50H], xmm0 + mov r9, qword ptr [rsp+80H] + mov qword ptr [rsp+60H], r9 mov rcx, r8 - lea rdx, [rsp+48H] + lea rdx, [rsp+50H] mov r8, qword ptr [r8] mov r8, qword ptr [r8+48H] call [r8+20H] mov r8, rax - ;; size=44 bbWeight=236333.86 PerfScore 3308674.06 -G_M31648_IG16: ;; offset=02EBH + ;; size=78 bbWeight=236333.86 PerfScore 5199344.96 +G_M31648_IG16: ;; offset=02A1H test r8, r8 je SHORT G_M31648_IG21 ;; size=5 bbWeight=621931.21 PerfScore 777414.02 -G_M31648_IG17: ;; offset=02F0H +G_M31648_IG17: ;; offset=02A6H jmp SHORT G_M31648_IG19 ;; size=2 bbWeight=80248.55 PerfScore 160497.11 -G_M31648_IG18: ;; offset=02F2H +G_M31648_IG18: ;; offset=02A8H mov rcx, 0x7FF86873C618 ; Benchmarks.SIMD.RayTracer.ISect call CORINFO_HELP_NEWSFAST mov r8, rax - mov gword ptr [rsp+28H], r8 + mov gword ptr [rsp+30H], r8 lea rcx, bword ptr [r8+08H] - mov rdx, gword ptr [rsp+30H] + mov rdx, gword ptr [rsp+38H] call CORINFO_HELP_ASSIGN_REF - mov r8, gword ptr [rsp+28H] - vmovdqu xmm0, xmmword ptr [rsp+68H] - vmovdqu xmmword ptr [r8+18H], xmm0 - mov rcx, qword ptr [rsp+78H] - mov qword ptr [r8+28H], rcx + mov r8, gword ptr [rsp+30H] + lea rcx, bword ptr [r8+18H] + vmovsd qword ptr [rcx], xmm7 + vextractps dword ptr [rcx+08H], xmm7, 2 + vmovsd qword ptr [rcx+0CH], xmm9 + vextractps dword ptr [rcx+14H], xmm9, 2 vcvtss2sd xmm0, xmm0, xmm10 vmovsd qword ptr [r8+10H], xmm0 jmp SHORT G_M31648_IG16 - ;; size=76 bbWeight=385597.35 PerfScore 8097544.42 -G_M31648_IG19: ;; offset=033EH + ;; size=82 bbWeight=385597.35 PerfScore 10218329.86 +G_M31648_IG19: ;; offset=02FAH test r13, r13 jne SHORT G_M31648_IG25 ;; size=5 bbWeight=80248.55 PerfScore 100310.69 -G_M31648_IG20: ;; offset=0343H +G_M31648_IG20: ;; offset=02FFH mov r13, r8 ;; size=3 bbWeight=78703.22 PerfScore 19675.80 -G_M31648_IG21: ;; offset=0346H - mov edx, dword ptr [rsp+64H] +G_M31648_IG21: ;; offset=0302H + mov edx, dword ptr [rsp+6CH] inc edx - mov eax, dword ptr [rsp+44H] + mov eax, dword ptr [rsp+4CH] cmp eax, edx jg SHORT G_M31648_IG24 ;; size=14 bbWeight=621931.21 PerfScore 2176759.25 -G_M31648_IG22: ;; offset=0354H +G_M31648_IG22: ;; offset=0310H mov r8, r13 test r8, r8 jne SHORT G_M31648_IG26 ;; size=8 bbWeight=208614.94 PerfScore 312922.41 -G_M31648_IG23: ;; offset=035CH +G_M31648_IG23: ;; offset=0318H vxorps xmm0, xmm0, xmm0 - vmovaps xmmword ptr [rsp+80H], xmm0 + vmovups xmmword ptr [rsp+88H], xmm0 jmp SHORT G_M31648_IG27 ;; size=15 bbWeight=79255.46 PerfScore 264184.86 -G_M31648_IG24: ;; offset=036BH - mov rax, gword ptr [rsp+38H] +G_M31648_IG24: ;; offset=0327H + mov rax, gword ptr [rsp+40H] jmp G_M31648_IG08 ;; size=10 bbWeight=310965.61 PerfScore 932896.82 -G_M31648_IG25: ;; offset=0375H +G_M31648_IG25: ;; offset=0331H vmovsd xmm0, qword ptr [r13+10H] vucomisd xmm0, qword ptr [r8+10H] jbe SHORT G_M31648_IG21 jmp SHORT G_M31648_IG20 ;; size=16 bbWeight=1546.37 PerfScore 18556.45 -G_M31648_IG26: ;; offset=0385H +G_M31648_IG26: ;; offset=0341H xor edx, edx mov dword ptr [rsp+20H], edx - lea rdx, [rsp+80H] + lea rdx, [rsp+88H] mov rcx, rsi mov r9, rbx call [] ;; size=26 bbWeight=129359.48 PerfScore 679137.28 -G_M31648_IG27: ;; offset=039FH - vmovaps xmm0, xmmword ptr [rsp+80H] +G_M31648_IG27: ;; offset=035BH + vmovups xmm0, xmmword ptr [rsp+88H] vunpckhps xmm1, xmm0, xmm0 vmovss xmm2, dword ptr [reloc @RWD36] vmulss xmm1, xmm1, xmm2 @@ -376,14 +369,14 @@ G_M31648_IG27: ;; offset=039FH cmp eax, 255 jg G_M31648_IG34 ;; size=40 bbWeight=208614.94 PerfScore 3598607.70 -G_M31648_IG28: ;; offset=03C7H +G_M31648_IG28: ;; offset=0383H vmovshdup xmm1, xmm0 vmulss xmm1, xmm1, xmm2 vcvttss2si edx, xmm1 cmp edx, 255 jg G_M31648_IG35 ;; size=24 bbWeight=208614.94 PerfScore 2346918.07 -G_M31648_IG29: ;; offset=03DFH +G_M31648_IG29: ;; offset=039BH shl edx, 8 or edx, eax vmulss xmm2, xmm0, xmm2 @@ -391,7 +384,7 @@ G_M31648_IG29: ;; offset=03DFH cmp eax, 255 jg G_M31648_IG36 ;; size=24 bbWeight=208614.94 PerfScore 2294764.33 -G_M31648_IG30: ;; offset=03F7H +G_M31648_IG30: ;; offset=03B3H lea ecx, [r12+r14] cmp ecx, dword ptr [rdi+08H] jae G_M31648_IG37 @@ -403,19 +396,19 @@ G_M31648_IG30: ;; offset=03F7H cmp r12d, dword ptr [rsi+20H] jl G_M31648_IG05 ;; size=41 bbWeight=208614.94 PerfScore 2242610.60 -G_M31648_IG31: ;; offset=0420H +G_M31648_IG31: ;; offset=03DCH inc ebp cmp ebp, dword ptr [rsi+24H] jl G_M31648_IG03 ;; size=11 bbWeight=999.00 PerfScore 4245.75 -G_M31648_IG32: ;; offset=042BH - vmovaps xmm6, xmmword ptr [rsp+100H] - vmovaps xmm7, xmmword ptr [rsp+F0H] - vmovaps xmm8, xmmword ptr [rsp+E0H] - vmovaps xmm9, xmmword ptr [rsp+D0H] - vmovaps xmm10, xmmword ptr [rsp+C0H] - vmovaps xmm11, xmmword ptr [rsp+B0H] - add rsp, 280 +G_M31648_IG32: ;; offset=03E7H + vmovaps xmm6, xmmword ptr [rsp+F0H] + vmovaps xmm7, xmmword ptr [rsp+E0H] + vmovaps xmm8, xmmword ptr [rsp+D0H] + vmovaps xmm9, xmmword ptr [rsp+C0H] + vmovaps xmm10, xmmword ptr [rsp+B0H] + vmovaps xmm11, xmmword ptr [rsp+A0H] + add rsp, 264 pop rbx pop rbp pop rsi @@ -426,23 +419,23 @@ G_M31648_IG32: ;; offset=042BH pop r15 ret ;; size=74 bbWeight=1.00 PerfScore 29.25 -G_M31648_IG33: ;; offset=0475H +G_M31648_IG33: ;; offset=0431H vmovss xmm9, dword ptr [reloc @RWD40] jmp G_M31648_IG07 ;; size=13 bbWeight=0 PerfScore 0.00 -G_M31648_IG34: ;; offset=0482H +G_M31648_IG34: ;; offset=043EH mov eax, 255 jmp G_M31648_IG28 ;; size=10 bbWeight=0 PerfScore 0.00 -G_M31648_IG35: ;; offset=048CH +G_M31648_IG35: ;; offset=0448H mov edx, 255 jmp G_M31648_IG29 ;; size=10 bbWeight=0 PerfScore 0.00 -G_M31648_IG36: ;; offset=0496H +G_M31648_IG36: ;; offset=0452H mov eax, 255 jmp G_M31648_IG30 ;; size=10 bbWeight=0 PerfScore 0.00 -G_M31648_IG37: ;; offset=04A0H +G_M31648_IG37: ;; offset=045CH call CORINFO_HELP_RNGCHKFAIL int3 ;; size=6 bbWeight=0 PerfScore 0.00 @@ -454,11 +447,11 @@ RWD36 dd 437F0000h ; 255 RWD40 dd 7F800000h ; inf -; Total bytes of code 1190, prolog size 103, PerfScore 105089852.53, instruction count 255, allocated bytes for code 1190 (MethodHash=adae845f) for method Benchmarks.SIMD.RayTracer.RayTracer:RenderSequential(Benchmarks.SIMD.RayTracer.Scene,int[]):this +; Total bytes of code 1122, prolog size 103, PerfScore 105396818.57, instruction count 245, allocated bytes for code 1122 (MethodHash=adae845f) for method Benchmarks.SIMD.RayTracer.RayTracer:RenderSequential(Benchmarks.SIMD.RayTracer.Scene,int[]):this ``` This is a case where our lack of handling for call args shows up. We end up with an extra struct copy in `G_M31648_IG15` because physical promotion inserts a writeback into the struct local, and then call args morphing creates a copy of it since it isn't a last use. One simple fix in physical promotion for the implicit byref case would be to create a new local to ensure that it is a last use; we can then handle it by our smarter decomposition. That might be a good short-term solution with large benefit. There is also a redundant `lea` instruction in `G_M31648_IG18`; we are not able to peel the address because it is a `FIELD_ADDR` node. Also, the copy itself needs to use `vextractps` since it is `TYP_SIMD12`.

dotnet / runtime