Open matthias314 opened 1 month ago
I'm seeing ~8 seconds on 1.10, so I'm not fully sure what's going on here.
What processor do you have? I should have pointed out that on x86-64 and i686 with BMI2 the code uses the PDEP instruction (via llvm.x86.bmi.pdep.64
). Without BMI2 and for other architectures it is emulated in software. This instruction is used heavily, so results could be quite different.
(I've just checked: PDEP does get used with 1.11.0-rc1.)
I have a Zen 2 (3600) which iirc also has BMI2.
oh, Zen2 only has micro-coded PDEP that LLVM is likely avoiding since it is slow. So the issue seems to be whether or not LLVM is emitting PDEP instructions where it should.
Actually looking at the code shows that I'm getting pdeps emitted in both 1.10 and 1.11. The difference seems to be that 1.11 is spilling and reloading a ton of variables. e.g:
.LBB0_15: # %guard_pass1430
# in Loop: Header=BB0_2 Depth=1
mov r13, qword ptr [rbp - 792] # 8-byte Reload
mov rax, qword ptr [rbp - 120] # 8-byte Reload
mov rcx, qword ptr [rbp - 568] # 8-byte Reload
mov rsi, qword ptr [rbp - 536] # 8-byte Reload
mov r8, qword ptr [rbp - 560] # 8-byte Reload
mov r11, qword ptr [rbp - 408] # 8-byte Reload
pdep rdi, r13, rax
mov qword ptr [rbp - 656], rcx # 8-byte Spill
mov rcx, qword ptr [rbp - 544] # 8-byte Reload
mov qword ptr [rbp - 176], rsi # 8-byte Spill
mov rsi, qword ptr [rbp - 472] # 8-byte Reload
mov qword ptr [rbp - 648], r8 # 8-byte Spill
mov r8, qword ptr [rbp - 528] # 8-byte Reload
mov qword ptr [rbp - 160], rsi # 8-byte Spill
mov rsi, qword ptr [rbp - 552] # 8-byte Reload
mov qword ptr [rbp - 616], rcx # 8-byte Spill
mov qword ptr [rbp - 88], rcx # 8-byte Spill
mov qword ptr [rbp - 632], r8 # 8-byte Spill
mov qword ptr [rbp - 640], rsi # 8-byte Spill
mov rsi, qword ptr [rbp - 520] # 8-byte Reload
mov qword ptr [rbp - 624], rsi # 8-byte Spill
mov rsi, qword ptr [rbp - 504] # 8-byte Reload
mov qword ptr [rbp - 600], rsi # 8-byte Spill
mov qword ptr [rbp - 184], rsi # 8-byte Spill
mov r12, rdi
xor r12, rax
mov al, 1
mov dword ptr [rbp - 48], eax # 4-byte Spill
mov rax, rcx
mov rcx, qword ptr [rbp - 496] # 8-byte Reload
mov rax, qword ptr [rbp - 576] # 8-byte Reload
mov qword ptr [rbp - 152], rcx # 8-byte Spill
mov rcx, qword ptr [rbp - 488] # 8-byte Reload
mov qword ptr [rbp - 208], rax # 8-byte Spill
mov rax, rcx
mov qword ptr [rbp - 608], rcx # 8-byte Spill
mov qword ptr [rbp - 128], rcx # 8-byte Spill
mov rcx, qword ptr [rbp - 424] # 8-byte Reload
mov rax, qword ptr [rbp - 480] # 8-byte Reload
mov qword ptr [rbp - 376], rcx # 8-byte Spill
mov rcx, rsi
mov rcx, qword ptr [rbp - 416] # 8-byte Reload
mov rsi, qword ptr [rbp - 456] # 8-byte Reload
mov qword ptr [rbp - 168], rax # 8-byte Spill
mov rax, qword ptr [rbp - 400] # 8-byte Reload
mov qword ptr [rbp - 592], rcx # 8-byte Spill
mov rcx, qword ptr [rbp - 512] # 8-byte Reload
mov qword ptr [rbp - 200], rsi # 8-byte Spill
mov rsi, qword ptr [rbp - 440] # 8-byte Reload
mov qword ptr [rbp - 584], rcx # 8-byte Spill
mov rcx, qword ptr [rbp - 448] # 8-byte Reload
mov qword ptr [rbp - 192], rsi # 8-byte Spill
mov qword ptr [rbp - 672], rcx # 8-byte Spill
mov rcx, qword ptr [rbp - 432] # 8-byte Reload
mov qword ptr [rbp - 680], rcx # 8-byte Spill
mov rcx, qword ptr [rbp - 392] # 8-byte Reload
mov qword ptr [rbp - 112], rcx # 8-byte Spill
mov rcx, qword ptr [rbp - 384] # 8-byte Reload
mov qword ptr [rbp - 136], rcx # 8-byte Spill
mov rcx, qword ptr [rbp - 464] # 8-byte Reload
mov qword ptr [rbp - 664], rcx # 8-byte Spill
@topolarity or @gbaraldi any idea why the codegen would be this bad?
I don't know if this adds to the mystery or clears it up, but here is another function that I would expect to be equivalent to the f
from above:
function g(sh)
s = Ref(0)
@inline foreach(sh) do (t, _)
s[] += sum(length, t)
end
s[]
end
With this g
, however, the difference disappears:
julia> @b g($sh)
1.709 s (without a warmup) # 1.10.4
1.718 s (without a warmup) # 1.11.0-rc1
Potentially https://github.com/llvm/llvm-project/issues/78506 ? Or excessive unrolling?
8e4221f676cc0cc242b93389b2f65084931e581c is the first bad commit Date: Wed Jan 10 03:47:02 2024 +0100 Bump Julia to LLVM 16 (#51720)
@gbaraldi In my example, iterate(sh)
is implemented recursively over shuffles
of smaller length. With 7 parameters, one therefore has many state variables:
julia> sizeof(iterate(sh)[2])
240
This sounds indeed like the "lots of temporary variables" mentioned in llvm/llvm-project#78506.
The problem is also present in master (LLVM 18), but to a smaller extent: There I see a 2.4x slowdown instead of 5.4x for 1.11.0-rc2.
Julia Version 1.12.0-DEV.989
Commit 40ecf69019 (2024-08-05 12:54 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 4 × Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz
WORD_SIZE: 64
LLVM: libLLVM-18.1.7 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)
As mentioned in the updated OP, one can now use the latest published version (v0.3.0) of SmallCollections.jl to reproduce the issue.
possible dup: #52933
While working on my package SmallCollections.jl, I've noticed a significant slowdown in 1.11.0-rc1:
EDIT:
sh = shuffles(2, 2, 2, 2, 2, 2)
(6 arguments) is enough for a 5x slowdown.The functionality is not yet in the published version, nor in master.You can use the latest published version (v0.3.0) of SmallCollections.jl.