Open kg opened 1 week ago
LICMing them into volatile registers feels fine.
My understanding that you can't use volatile registers here since you have a call in the loop, so you will have to do spill inside the loop.
This vector LICM seems like it is probably profitable - even with the stack spill - for more expensive vector operations. vpbroadcastb has extremely low latency/cost from what I know though, so in this case it's potentially worse.
Well, you're trading an instruction with Latency=3 (e.g. Tiger Lake) in a loop vs a single stack spill/reload, I am not so sure it's cheaper, especially, for a long-running loop. cc @dotnet/jit-contrib
PS: if you check codegen for Linux (SysV 64 ABI) it will emit what you expect due to lack of callee-saved float regs
With PGO we GVD the call, but don't clone the loop because the GCV test is not loop invariant, the dispatch is via a static. Seems like a missed opportunity, if we cloned, the fast loop would have no synchronization points and we could then CSE the GDV and end up with a fast loop with no calls.
However we'd still have the slow loop with the call; this would likely lead to the same spill unless we're less aggressive with "cheap" LICM in the slow loop (which seems possible/reasonable). If we are too aggressive this might make a case for thinking more seriously about shrink-wrapping (deferring prolog saves until "needed"). For FP/Vector perhaps it is not too bad?
I suspect the full example is different and we may in fact clone, so would be good to study that more closely.
Here's the tier 0 / tier 1 disasmo from the full example on NET8 with tiering + PGO enabled. In this case, it's SimdDictionary<string, long>
. Some various optimizations I did have managed to get the stack spills out of the loop itself, so the only real use of the stack is the xmm6 save/restore upon entry and exit.
; Assembly listing for method DisasmHarness:TryGetValue():bool (Instrumented Tier0)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; Instrumented Tier0 code
; rbp based frame
; partially interruptible
G_M000_IG01: ;; offset=0x0000
push rbp
sub rsp, 48
lea rbp, [rsp+0x30]
xor eax, eax
mov qword ptr [rbp-0x08], rax
G_M000_IG02: ;; offset=0x0010
mov rcx, 0x1C6D4C01D60
mov rcx, gword ptr [rcx]
mov rdx, 0x1C6D4C01D68
mov rdx, gword ptr [rdx]
lea r8, [rbp-0x08]
cmp dword ptr [rcx], ecx
call [SimdDictionary.SimdDictionary`2[System.__Canon,long]:TryGetValue(System.__Canon,byref):bool:this]
nop
G_M000_IG03: ;; offset=0x0037
add rsp, 48
pop rbp
ret
; Total bytes of code 61
; Assembly listing for method DisasmHarness:TryGetValue():bool (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; Tier1 code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 5 inlinees with PGO data; 10 single block inlinees; 3 inlinees without PGO data
G_M000_IG01: ;; offset=0x0000
push r15
push r14
push r13
push r12
push rdi
push rsi
push rbp
push rbx
sub rsp, 56
vzeroupper
vmovaps xmmword ptr [rsp+0x20], xmm6 ;; making space for the search vector that got LICM'd
G_M000_IG02: ;; offset=0x0019
mov rcx, 0x1C6D4C01D60
mov rbx, gword ptr [rcx]
mov rcx, 0x1C6D4C01D68
mov rsi, gword ptr [rcx]
mov rdi, gword ptr [rbx+0x08]
mov rcx, rdi
mov rdx, rsi
mov r11, 0x7FFA3D2F0030
call [r11]System.Collections.Generic.IEqualityComparer`1[System.__Canon]:GetHashCode(System.__Canon):int:this
mov r8, gword ptr [rbx+0x10]
test r8, r8
je G_M000_IG17
G_M000_IG03: ;; offset=0x0057
lea rbp, bword ptr [r8+0x10]
mov r14d, dword ptr [r8+0x08]
G_M000_IG04: ;; offset=0x005F
mov r8d, eax
imul r8, qword ptr [rbx+0x18] ;; fastmod to turn hash code into bucket index
shr r8, 32
inc r8
mov ecx, r14d
imul r8, rcx
shr r8, 32
sub r14d, r8d
dec r14d
movsxd r8, r8d
imul r15, r8, 240 ;; turning bucket index into address of first bucket
add r15, rbp
mov r13, r15
shr eax, 24 ;; compute hash suffix
movzx r8, al
mov ecx, 255 ;; cmov to turn hash suffix into 255 if it's 0
test r8d, r8d
cmovne ecx, r8d
vpbroadcastb xmm6, ecx ;; compute search vector from suffix (this was LICM'd)
G_M000_IG05: ;; offset=0x00A8
movzx r12, byte ptr [r15+0x0E] ;; start of loop body; read count byte from bucket
vpcmpeqb xmm0, xmm6, xmmword ptr [r15] ;; search bucket suffixes for suffix
vpmovmskb r8d, xmm0
tzcnt r8d, r8d ;; convert search result vector to index of first (if any) result
mov rcx, r15
sub r12d, r8d
test r12d, r12d
jg SHORT G_M000_IG10
G_M000_IG06: ;; offset=0x00C6
xor rbx, rbx
G_M000_IG07: ;; offset=0x00C8
test rbx, rbx
je SHORT G_M000_IG13
G_M000_IG08: ;; offset=0x00CD
xor eax, eax
test rbx, rbx
setne al
G_M000_IG09: ;; offset=0x00D5
vmovaps xmm6, xmmword ptr [rsp+0x20] ;; restore xmm6 due to LICM
add rsp, 56
pop rbx
pop rbp
pop rsi
pop rdi
pop r12
pop r13
pop r14
pop r15
ret
G_M000_IG10: ;; offset=0x00EC
movsxd r8, r8d
shl r8, 4
lea rbx, bword ptr [rcx+r8+0x10]
G_M000_IG11: ;; offset=0x00F8
mov r8, gword ptr [rbx]
mov rcx, rdi
mov rdx, rsi
mov r11, 0x7FFA3D2F0038 ;; the method code below appears to force use of xmm6, which makes sense
call [r11]System.Collections.Generic.IEqualityComparer`1[System.__Canon]:Equals(System.__Canon,System.__Canon):bool:this
test eax, eax
je SHORT G_M000_IG18
G_M000_IG12: ;; offset=0x0112
jmp SHORT G_M000_IG07
G_M000_IG13: ;; offset=0x0114
cmp byte ptr [r15+0x0F], 0
je SHORT G_M000_IG16
test r14d, r14d
jg SHORT G_M000_IG14
mov r15, rbp
mov r14d, 0x7FFFFFFF
jmp SHORT G_M000_IG15
G_M000_IG14: ;; offset=0x012B
add r15, 240
dec r14d
G_M000_IG15: ;; offset=0x0135
cmp r15, r13
jne G_M000_IG05
G_M000_IG16: ;; offset=0x013E
xor rbx, rbx
jmp SHORT G_M000_IG08
G_M000_IG17: ;; offset=0x0142
xor rbp, rbp
xor r14d, r14d
jmp G_M000_IG04
G_M000_IG18: ;; offset=0x014C
dec r12d
je G_M000_IG06
add rbx, 16
jmp SHORT G_M000_IG11
; Total bytes of code 347
Still surprised we don't clone for the inner loop GDV.
Also don't like seeing the loop split by the epilog. This may be fixed in 9...
I'll run some tests with 9 soon, and also run some tests on ARM64 to see how our codegen looks there.
@kg, please get back to us how it looks with 9.
I'll run some tests with 9 soon, and also run some tests on ARM64 to see how our codegen looks there.
I've been unable to get the LICM to happen inside a BDN benchmark harness w/DisassemblyDiagnoser, and I couldn't figure out how to get disasmo working with the .NET 9 preview. Is there an alternate method I should be using to get accurate disassembly from the 9 preview to compare with 8?
Regardless, it looks like 9's codegen for this scenario might be a little better, as the performance is improved. However, I can't get disassembly captured for the 9 preview unless I set AggressiveInlining for all of the code, which means the lookups get woven in to the benchmark loop and the result integrity check, which may be what's throwing things off. It's also possible PGO/tiering is the issue here, and the codegen is being influenced by the fact that the benchmark always finds the key it's looking for.
Attaching a ZIP that contains the DisassemblyDiagnoser output for a <string, long>
lookup w/default comparer on 8.0 and 9.0p1.
Benchmarks.StringLookup-report.zip
The code size is much smaller than I expected, so I'm not sure what's going on with that either. In disasmo for net8 I get 562 bytes of generated code, vs the 472 reported here.
Method | Runtime | Mean | Error | StdDev | Ratio | RatioSD | Code Size |
---|---|---|---|---|---|---|---|
FindExistingSIMD | .NET 8.0 | 511.6 μs | 9.33 μs | 8.73 μs | 1.00 | 0.02 | 472 B |
FindExistingSIMD | .NET 9.0 | 487.8 μs | 8.96 μs | 8.38 μs | 0.95 | 0.02 | 467 B |
Still surprised we don't clone for the inner loop GDV.
Also don't like seeing the loop split by the epilog. This may be fixed in 9...
Re: the loop split, it looks like this still happens in 9. I tried changing the disassembly harness so that the TryGetValue call alternates between succeeding and failing, and that seems to have moved the epilog further down towards the end of the method in NET8, so I think it comes down to the tiering/PGO having previously seen that the path taken through the method was always the same (since I was always looking up the same key in the disasmo harness, unlike the BDN benchmarks.)
I'll note that once I changed it to call TryGetValue twice per iteration, it added a message like this to the disassembly. I don't know if it means anything or not but I figured I'd cite it here. "edge weights are invalid" seems weird.
; with Dynamic PGO: edge weights are invalid, and fgCalledCount is 2110
; 10 inlinees with PGO data; 20 single block inlinees; 6 inlinees without PGO data
The "edge weights are invalid" is a .NET 8 and earlier thing. Until recently the JIT would try and derive flow graph edge weights from block weights and (if things were sufficiently messy) might fail to find a suitable set of weights and so declare them invalid. This is gone in .NET 9.
It's possible that the PGO data here is a bit thin. Since your benchmark method has a loop it may inspire OSR and depending on how BDN stages things the run may not get too many cycles in Tier0+instr and/or not see profile data for inlinees. OSR+PGO interaction is not as smooth as we'd like.
If you're on windows you can (as admin) run BDN with -p ETW
(assuming you're using BenchmarkSwitcher
) and pass the resulting file to jitutils/instructionsretiredeexplorer with the -benchmark
option to see if any BDN measurement intervals involved OSR code.
This issue has been marked needs-author-action
and may be missing some important information.
Description
If you have a loop that does
Vector128.Create(scalar)
each iteration on purpose, RyuJIT can LICM the vector constant out of the loop. This sounds great in theory, but if you look at the generated code, this can move the cached vpbroadcastb result into xmm6 or a similar nonvolatile register, which can force your method to spill the previous value of xmm6 to the stack on entry and restore it on exit. In my real-world scenario, without this LICM the stack isn't used other than some regular register pushes at entry.Simple method calls or method-call-free loop bodies don't seem to trigger this, RyuJIT will happily use the volatile xmm registers instead and everything is good. An IEqualityComparer.Equals call inside the loop body is enough to force use of xmm6, though.
Reproduction Steps
This demonstrates the LICM in a toy scenario. For a real-world scenario, https://github.com/kg/SimdDictionary/blob/e42d42f7158d2552c97e1a517b82d4bb32750456/DisasmHarness/DisasmHarness.cs#L14 when disasmo'd will demonstrate it.
Expected behavior
Neither of the
vpbroadcastb
's should be LICM'd out of the loop if they end up in a nonvolatile register IMO, since doing that forces a stack spill at entry and a stack restore at exit. LICMing them into volatile registers feels fine.Actual behavior
The provably-constant
vpbroadcastb
is LICM'd out of the loop, which is correct but potentially deleterious for performanceRegression?
Don't know. Can't test since my real use case won't work on NET7.
Known Workarounds
EDIT: Changing the type of the
scalar
parameter toin byte
prevents the broadcast from being LICM'd. Putting a Thread.MemoryBarrier doesn't stop the broadcast from moving out of the loop (past the barrier), which is a little surprising, but probably completely safe. You can also block LICM by falsely taking the address of the local containing the scalar, i.e.EDIT 2: Inserting a
Sse2.MemoryFence()
right before the broadcast operation stops it from being LICM'd, but it's pretty clear that it isn't a good solution.EDIT 3: A
Sse.Prefetch0(...)
also works as a LICM barrier but seems to generate extra wasted code to calculate the address to prefetch for some reason.Configuration
Microsoft Visual Studio Community 2022 (64-bit) - Current Version 17.11.3
Other information
This vector LICM seems like it is probably profitable - even with the stack spill - for more expensive vector operations. vpbroadcastb has extremely low latency/cost from what I know though, so in this case it's potentially worse. The code I'm optimizing typically runs its loop body 0-1 times, so the LICM causes a performance hit.