Potentially bad loop invariant code motion for vectors

kg commented 1 week ago

Description

If you have a loop that does Vector128.Create(scalar) each iteration on purpose, RyuJIT can LICM the vector constant out of the loop. This sounds great in theory, but if you look at the generated code, this can move the cached vpbroadcastb result into xmm6 or a similar nonvolatile register, which can force your method to spill the previous value of xmm6 to the stack on entry and restore it on exit. In my real-world scenario, without this LICM the stack isn't used other than some regular register pushes at entry.

Simple method calls or method-call-free loop bodies don't seem to trigger this, RyuJIT will happily use the volatile xmm registers instead and everything is good. An IEqualityComparer.Equals call inside the loop body is enough to force use of xmm6, though.

Reproduction Steps

This demonstrates the LICM in a toy scenario. For a real-world scenario, https://github.com/kg/SimdDictionary/blob/e42d42f7158d2552c97e1a517b82d4bb32750456/DisasmHarness/DisasmHarness.cs#L14 when disasmo'd will demonstrate it.

    public static IEqualityComparer<byte> Comparer = EqualityComparer<byte>.Default;

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static void VectorLicm (byte scalar) {
        int result = 0;
        for (int i = 0; i < 256; i++) {
            var mask = Vector128.Equals(Vector128.Create(scalar), Vector128.Create(unchecked((byte)i)));
            if (!Comparer.Equals(mask.ToScalar(), 0))
                result++;
        }
    }

Expected behavior

Neither of the vpbroadcastb's should be LICM'd out of the loop if they end up in a nonvolatile register IMO, since doing that forces a stack spill at entry and a stack restore at exit. LICMing them into volatile registers feels fine.

Actual behavior

The provably-constant vpbroadcastb is LICM'd out of the loop, which is correct but potentially deleterious for performance

; Assembly listing for method DisasmHarness:VectorLicm(ubyte) (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data

G_M000_IG01:
       push     rdi
       push     rsi
       push     rbx
       sub      rsp, 48
       vzeroupper 
       vmovaps  xmmword ptr [rsp+0x20], xmm6

G_M000_IG02:
       xor      ebx, ebx
       xor      esi, esi
       movzx    rdx, cl
       vpbroadcastb  xmm6, edx
       mov      rdi, 0xD1FFAB1E

G_M000_IG03:
       movzx    rdx, sil
       vpbroadcastb  xmm0, edx
       vpcmpeqb xmm0, xmm6, xmm0
       vmovd    edx, xmm0
       movzx    rdx, dl
       mov      rcx, gword ptr [rdi]
       mov      r11, 0xD1FFAB1E
       xor      r8d, r8d
       call     [r11]System.Collections.Generic.IEqualityComparer`1[ubyte]:Equals(ubyte,ubyte):bool:this
       test     eax, eax
       jne      SHORT G_M000_IG05

G_M000_IG04:
       inc      ebx

G_M000_IG05:
       inc      esi
       cmp      esi, 256
       jl       SHORT G_M000_IG03

G_M000_IG06:
       vmovaps  xmm6, xmmword ptr [rsp+0x20]
       add      rsp, 48
       pop      rbx
       pop      rsi
       pop      rdi
       ret      

; Total bytes of code 109

Regression?

Don't know. Can't test since my real use case won't work on NET7.

Known Workarounds

EDIT: Changing the type of the scalar parameter to in byte prevents the broadcast from being LICM'd. Putting a Thread.MemoryBarrier doesn't stop the broadcast from moving out of the loop (past the barrier), which is a little surprising, but probably completely safe. You can also block LICM by falsely taking the address of the local containing the scalar, i.e.

...
ref var temp = ref scalar;
...
return temp;

EDIT 2: Inserting a Sse2.MemoryFence() right before the broadcast operation stops it from being LICM'd, but it's pretty clear that it isn't a good solution.

EDIT 3: A Sse.Prefetch0(...) also works as a LICM barrier but seems to generate extra wasted code to calculate the address to prefetch for some reason.

Configuration

Microsoft Visual Studio Community 2022 (64-bit) - Current Version 17.11.3

PS C:\Users\kg> dotnet --info
.NET SDK:
 Version:           8.0.400
 Commit:            36fe6dda56
 Workload version:  8.0.400-manifests.6c274a57
 MSBuild version:   17.11.3+0c8610977

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.22631
 OS Platform: Windows
 RID:         win-x64
 Base Path:   C:\Program Files\dotnet\sdk\8.0.400\

Other information

This vector LICM seems like it is probably profitable - even with the stack spill - for more expensive vector operations. vpbroadcastb has extremely low latency/cost from what I know though, so in this case it's potentially worse. The code I'm optimizing typically runs its loop body 0-1 times, so the LICM causes a performance hit.

EgorBo commented 1 week ago

LICMing them into volatile registers feels fine.

My understanding that you can't use volatile registers here since you have a call in the loop, so you will have to do spill inside the loop.

This vector LICM seems like it is probably profitable - even with the stack spill - for more expensive vector operations. vpbroadcastb has extremely low latency/cost from what I know though, so in this case it's potentially worse.

Well, you're trading an instruction with Latency=3 (e.g. Tiger Lake) in a loop vs a single stack spill/reload, I am not so sure it's cheaper, especially, for a long-running loop. cc @dotnet/jit-contrib

PS: if you check codegen for Linux (SysV 64 ABI) it will emit what you expect due to lack of callee-saved float regs

AndyAyersMS commented 1 week ago

With PGO we GVD the call, but don't clone the loop because the GCV test is not loop invariant, the dispatch is via a static. Seems like a missed opportunity, if we cloned, the fast loop would have no synchronization points and we could then CSE the GDV and end up with a fast loop with no calls.

However we'd still have the slow loop with the call; this would likely lead to the same spill unless we're less aggressive with "cheap" LICM in the slow loop (which seems possible/reasonable). If we are too aggressive this might make a case for thinking more seriously about shrink-wrapping (deferring prolog saves until "needed"). For FP/Vector perhaps it is not too bad?

I suspect the full example is different and we may in fact clone, so would be good to study that more closely.

kg commented 1 week ago

Here's the tier 0 / tier 1 disasmo from the full example on NET8 with tiering + PGO enabled. In this case, it's SimdDictionary<string, long>. Some various optimizations I did have managed to get the stack spills out of the loop itself, so the only real use of the stack is the xmm6 save/restore upon entry and exit.

; Assembly listing for method DisasmHarness:TryGetValue():bool (Instrumented Tier0)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; Instrumented Tier0 code
; rbp based frame
; partially interruptible

G_M000_IG01:                ;; offset=0x0000
       push     rbp
       sub      rsp, 48
       lea      rbp, [rsp+0x30]
       xor      eax, eax
       mov      qword ptr [rbp-0x08], rax

G_M000_IG02:                ;; offset=0x0010
       mov      rcx, 0x1C6D4C01D60
       mov      rcx, gword ptr [rcx]
       mov      rdx, 0x1C6D4C01D68
       mov      rdx, gword ptr [rdx]
       lea      r8, [rbp-0x08]
       cmp      dword ptr [rcx], ecx
       call     [SimdDictionary.SimdDictionary`2[System.__Canon,long]:TryGetValue(System.__Canon,byref):bool:this]
       nop      

G_M000_IG03:                ;; offset=0x0037
       add      rsp, 48
       pop      rbp
       ret      

; Total bytes of code 61

; Assembly listing for method DisasmHarness:TryGetValue():bool (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; Tier1 code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 5 inlinees with PGO data; 10 single block inlinees; 3 inlinees without PGO data

G_M000_IG01:                ;; offset=0x0000
       push     r15
       push     r14
       push     r13
       push     r12
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 56
       vzeroupper 
       vmovaps  xmmword ptr [rsp+0x20], xmm6 ;; making space for the search vector that got LICM'd

G_M000_IG02:                ;; offset=0x0019
       mov      rcx, 0x1C6D4C01D60
       mov      rbx, gword ptr [rcx]
       mov      rcx, 0x1C6D4C01D68
       mov      rsi, gword ptr [rcx]
       mov      rdi, gword ptr [rbx+0x08]
       mov      rcx, rdi
       mov      rdx, rsi
       mov      r11, 0x7FFA3D2F0030
       call     [r11]System.Collections.Generic.IEqualityComparer`1[System.__Canon]:GetHashCode(System.__Canon):int:this
       mov      r8, gword ptr [rbx+0x10]
       test     r8, r8
       je       G_M000_IG17

G_M000_IG03:                ;; offset=0x0057
       lea      rbp, bword ptr [r8+0x10]
       mov      r14d, dword ptr [r8+0x08]

G_M000_IG04:                ;; offset=0x005F
       mov      r8d, eax
       imul     r8, qword ptr [rbx+0x18] ;; fastmod to turn hash code into bucket index
       shr      r8, 32
       inc      r8
       mov      ecx, r14d
       imul     r8, rcx
       shr      r8, 32
       sub      r14d, r8d
       dec      r14d
       movsxd   r8, r8d
       imul     r15, r8, 240 ;; turning bucket index into address of first bucket
       add      r15, rbp
       mov      r13, r15
       shr      eax, 24 ;; compute hash suffix
       movzx    r8, al
       mov      ecx, 255 ;; cmov to turn hash suffix into 255 if it's 0
       test     r8d, r8d
       cmovne   ecx, r8d
       vpbroadcastb  xmm6, ecx ;; compute search vector from suffix (this was LICM'd)

G_M000_IG05:                ;; offset=0x00A8
       movzx    r12, byte  ptr [r15+0x0E] ;; start of loop body; read count byte from bucket
       vpcmpeqb xmm0, xmm6, xmmword ptr [r15] ;; search bucket suffixes for suffix
       vpmovmskb r8d, xmm0
       tzcnt    r8d, r8d ;; convert search result vector to index of first (if any) result
       mov      rcx, r15
       sub      r12d, r8d
       test     r12d, r12d
       jg       SHORT G_M000_IG10

G_M000_IG06:                ;; offset=0x00C6
       xor      rbx, rbx

G_M000_IG07:                ;; offset=0x00C8
       test     rbx, rbx
       je       SHORT G_M000_IG13

G_M000_IG08:                ;; offset=0x00CD
       xor      eax, eax
       test     rbx, rbx
       setne    al

G_M000_IG09:                ;; offset=0x00D5
       vmovaps  xmm6, xmmword ptr [rsp+0x20] ;; restore xmm6 due to LICM
       add      rsp, 56
       pop      rbx
       pop      rbp
       pop      rsi
       pop      rdi
       pop      r12
       pop      r13
       pop      r14
       pop      r15
       ret      

G_M000_IG10:                ;; offset=0x00EC
       movsxd   r8, r8d
       shl      r8, 4
       lea      rbx, bword ptr [rcx+r8+0x10]

G_M000_IG11:                ;; offset=0x00F8
       mov      r8, gword ptr [rbx]
       mov      rcx, rdi
       mov      rdx, rsi
       mov      r11, 0x7FFA3D2F0038 ;; the method code below appears to force use of xmm6, which makes sense
       call     [r11]System.Collections.Generic.IEqualityComparer`1[System.__Canon]:Equals(System.__Canon,System.__Canon):bool:this
       test     eax, eax
       je       SHORT G_M000_IG18

G_M000_IG12:                ;; offset=0x0112
       jmp      SHORT G_M000_IG07

G_M000_IG13:                ;; offset=0x0114
       cmp      byte  ptr [r15+0x0F], 0
       je       SHORT G_M000_IG16
       test     r14d, r14d
       jg       SHORT G_M000_IG14
       mov      r15, rbp
       mov      r14d, 0x7FFFFFFF
       jmp      SHORT G_M000_IG15

G_M000_IG14:                ;; offset=0x012B
       add      r15, 240
       dec      r14d

G_M000_IG15:                ;; offset=0x0135
       cmp      r15, r13
       jne      G_M000_IG05

G_M000_IG16:                ;; offset=0x013E
       xor      rbx, rbx
       jmp      SHORT G_M000_IG08

G_M000_IG17:                ;; offset=0x0142
       xor      rbp, rbp
       xor      r14d, r14d
       jmp      G_M000_IG04

G_M000_IG18:                ;; offset=0x014C
       dec      r12d
       je       G_M000_IG06
       add      rbx, 16
       jmp      SHORT G_M000_IG11

; Total bytes of code 347

AndyAyersMS commented 1 week ago

Still surprised we don't clone for the inner loop GDV.

Also don't like seeing the loop split by the epilog. This may be fixed in 9...

kg commented 1 week ago

I'll run some tests with 9 soon, and also run some tests on ARM64 to see how our codegen looks there.

JulieLeeMSFT commented 5 days ago

@kg, please get back to us how it looks with 9.

I'll run some tests with 9 soon, and also run some tests on ARM64 to see how our codegen looks there.

kg commented 5 days ago

I've been unable to get the LICM to happen inside a BDN benchmark harness w/DisassemblyDiagnoser, and I couldn't figure out how to get disasmo working with the .NET 9 preview. Is there an alternate method I should be using to get accurate disassembly from the 9 preview to compare with 8?

Regardless, it looks like 9's codegen for this scenario might be a little better, as the performance is improved. However, I can't get disassembly captured for the 9 preview unless I set AggressiveInlining for all of the code, which means the lookups get woven in to the benchmark loop and the result integrity check, which may be what's throwing things off. It's also possible PGO/tiering is the issue here, and the codegen is being influenced by the fact that the benchmark always finds the key it's looking for.

Attaching a ZIP that contains the DisassemblyDiagnoser output for a <string, long> lookup w/default comparer on 8.0 and 9.0p1. Benchmarks.StringLookup-report.zip

The code size is much smaller than I expected, so I'm not sure what's going on with that either. In disasmo for net8 I get 562 bytes of generated code, vs the 472 reported here.

Method	Runtime	Mean	Error	StdDev	Ratio	RatioSD	Code Size
FindExistingSIMD	.NET 8.0	511.6 μs	9.33 μs	8.73 μs	1.00	0.02	472 B
FindExistingSIMD	.NET 9.0	487.8 μs	8.96 μs	8.38 μs	0.95	0.02	467 B

kg commented 4 days ago

Still surprised we don't clone for the inner loop GDV.

Also don't like seeing the loop split by the epilog. This may be fixed in 9...

Re: the loop split, it looks like this still happens in 9. I tried changing the disassembly harness so that the TryGetValue call alternates between succeeding and failing, and that seems to have moved the epilog further down towards the end of the method in NET8, so I think it comes down to the tiering/PGO having previously seen that the path taken through the method was always the same (since I was always looking up the same key in the disasmo harness, unlike the BDN benchmarks.)

I'll note that once I changed it to call TryGetValue twice per iteration, it added a message like this to the disassembly. I don't know if it means anything or not but I figured I'd cite it here. "edge weights are invalid" seems weird.

; with Dynamic PGO: edge weights are invalid, and fgCalledCount is 2110
; 10 inlinees with PGO data; 20 single block inlinees; 6 inlinees without PGO data

AndyAyersMS commented 4 days ago

The "edge weights are invalid" is a .NET 8 and earlier thing. Until recently the JIT would try and derive flow graph edge weights from block weights and (if things were sufficiently messy) might fail to find a suitable set of weights and so declare them invalid. This is gone in .NET 9.

It's possible that the PGO data here is a bit thin. Since your benchmark method has a loop it may inspire OSR and depending on how BDN stages things the run may not get too many cycles in Tier0+instr and/or not see profile data for inlinees. OSR+PGO interaction is not as smooth as we'd like.

If you're on windows you can (as admin) run BDN with -p ETW (assuming you're using BenchmarkSwitcher) and pass the resulting file to jitutils/instructionsretiredeexplorer with the -benchmark option to see if any BDN measurement intervals involved OSR code.

dotnet-policy-service[bot] commented 3 days ago

This issue has been marked needs-author-action and may be missing some important information.

dotnet / runtime