5x slowdown in 1.11.0-rc1 compared to 1.10.4

matthias314 commented 1 month ago

While working on my package SmallCollections.jl, I've noticed a significant slowdown in 1.11.0-rc1:

using SmallCollections, Chairmarks

sh = shuffles(2, 2, 2, 2, 2, 2, 2)
# this is an iterator over all ordered partitions of 1:14
# into 7 2-element sets, plus a permutation sign

function f(sh)
    s = 0
    for (t, _) in sh
        # "t" is the ordered partition, "_" would be the permutation sign
        s += sum(length, t)   # the right-hand side is just 7*2 = 14
    end
    s
end

EDIT: sh = shuffles(2, 2, 2, 2, 2, 2) (6 arguments) is enough for a 5x slowdown.

julia> @b f($sh)
1.718 s (without a warmup)   # 1.10.4
9.277 s (without a warmup)   # 1.11.0-rc1

~~The functionality is not yet in the published version, nor in master.~~ You can use the latest published version (v0.3.0) of SmallCollections.jl.

Julia Version 1.11.0-rc1
Commit 3a35aec36d1 (2024-06-25 10:23 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

oscardssmith commented 1 month ago

I'm seeing ~8 seconds on 1.10, so I'm not fully sure what's going on here.

matthias314 commented 1 month ago

What processor do you have? I should have pointed out that on x86-64 and i686 with BMI2 the code uses the PDEP instruction (via llvm.x86.bmi.pdep.64). Without BMI2 and for other architectures it is emulated in software. This instruction is used heavily, so results could be quite different.

(I've just checked: PDEP does get used with 1.11.0-rc1.)

oscardssmith commented 1 month ago

I have a Zen 2 (3600) which iirc also has BMI2.

oscardssmith commented 1 month ago

oh, Zen2 only has micro-coded PDEP that LLVM is likely avoiding since it is slow. So the issue seems to be whether or not LLVM is emitting PDEP instructions where it should.

oscardssmith commented 1 month ago

Actually looking at the code shows that I'm getting pdeps emitted in both 1.10 and 1.11. The difference seems to be that 1.11 is spilling and reloading a ton of variables. e.g:

.LBB0_15:                               # %guard_pass1430
                                        #   in Loop: Header=BB0_2 Depth=1
    mov r13, qword ptr [rbp - 792]      # 8-byte Reload
    mov rax, qword ptr [rbp - 120]      # 8-byte Reload
    mov rcx, qword ptr [rbp - 568]      # 8-byte Reload
    mov rsi, qword ptr [rbp - 536]      # 8-byte Reload
    mov r8, qword ptr [rbp - 560]       # 8-byte Reload
    mov r11, qword ptr [rbp - 408]      # 8-byte Reload
    pdep    rdi, r13, rax
    mov qword ptr [rbp - 656], rcx      # 8-byte Spill
    mov rcx, qword ptr [rbp - 544]      # 8-byte Reload
    mov qword ptr [rbp - 176], rsi      # 8-byte Spill
    mov rsi, qword ptr [rbp - 472]      # 8-byte Reload
    mov qword ptr [rbp - 648], r8       # 8-byte Spill
    mov r8, qword ptr [rbp - 528]       # 8-byte Reload
    mov qword ptr [rbp - 160], rsi      # 8-byte Spill
    mov rsi, qword ptr [rbp - 552]      # 8-byte Reload
    mov qword ptr [rbp - 616], rcx      # 8-byte Spill
    mov qword ptr [rbp - 88], rcx       # 8-byte Spill
    mov qword ptr [rbp - 632], r8       # 8-byte Spill
    mov qword ptr [rbp - 640], rsi      # 8-byte Spill
    mov rsi, qword ptr [rbp - 520]      # 8-byte Reload
    mov qword ptr [rbp - 624], rsi      # 8-byte Spill
    mov rsi, qword ptr [rbp - 504]      # 8-byte Reload
    mov qword ptr [rbp - 600], rsi      # 8-byte Spill
    mov qword ptr [rbp - 184], rsi      # 8-byte Spill
    mov r12, rdi
    xor r12, rax
    mov al, 1
    mov dword ptr [rbp - 48], eax       # 4-byte Spill
    mov rax, rcx
    mov rcx, qword ptr [rbp - 496]      # 8-byte Reload
    mov rax, qword ptr [rbp - 576]      # 8-byte Reload
    mov qword ptr [rbp - 152], rcx      # 8-byte Spill
    mov rcx, qword ptr [rbp - 488]      # 8-byte Reload
    mov qword ptr [rbp - 208], rax      # 8-byte Spill
    mov rax, rcx
    mov qword ptr [rbp - 608], rcx      # 8-byte Spill
    mov qword ptr [rbp - 128], rcx      # 8-byte Spill
    mov rcx, qword ptr [rbp - 424]      # 8-byte Reload
    mov rax, qword ptr [rbp - 480]      # 8-byte Reload
    mov qword ptr [rbp - 376], rcx      # 8-byte Spill
    mov rcx, rsi
    mov rcx, qword ptr [rbp - 416]      # 8-byte Reload
    mov rsi, qword ptr [rbp - 456]      # 8-byte Reload
    mov qword ptr [rbp - 168], rax      # 8-byte Spill
    mov rax, qword ptr [rbp - 400]      # 8-byte Reload
    mov qword ptr [rbp - 592], rcx      # 8-byte Spill
    mov rcx, qword ptr [rbp - 512]      # 8-byte Reload
    mov qword ptr [rbp - 200], rsi      # 8-byte Spill
    mov rsi, qword ptr [rbp - 440]      # 8-byte Reload
    mov qword ptr [rbp - 584], rcx      # 8-byte Spill
    mov rcx, qword ptr [rbp - 448]      # 8-byte Reload
    mov qword ptr [rbp - 192], rsi      # 8-byte Spill
    mov qword ptr [rbp - 672], rcx      # 8-byte Spill
    mov rcx, qword ptr [rbp - 432]      # 8-byte Reload
    mov qword ptr [rbp - 680], rcx      # 8-byte Spill
    mov rcx, qword ptr [rbp - 392]      # 8-byte Reload
    mov qword ptr [rbp - 112], rcx      # 8-byte Spill
    mov rcx, qword ptr [rbp - 384]      # 8-byte Reload
    mov qword ptr [rbp - 136], rcx      # 8-byte Spill
    mov rcx, qword ptr [rbp - 464]      # 8-byte Reload
    mov qword ptr [rbp - 664], rcx      # 8-byte Spill

@topolarity or @gbaraldi any idea why the codegen would be this bad?

matthias314 commented 1 month ago

I don't know if this adds to the mystery or clears it up, but here is another function that I would expect to be equivalent to the f from above:

function g(sh)
    s = Ref(0)
    @inline foreach(sh) do (t, _)
        s[] += sum(length, t)
    end
    s[]
end

With this g, however, the difference disappears:

julia> @b g($sh)
1.709 s (without a warmup)   # 1.10.4
1.718 s (without a warmup)   # 1.11.0-rc1

gbaraldi commented 1 month ago

Potentially https://github.com/llvm/llvm-project/issues/78506 ? Or excessive unrolling?

KristofferC commented 1 month ago

8e4221f676cc0cc242b93389b2f65084931e581c is the first bad commit Date: Wed Jan 10 03:47:02 2024 +0100 Bump Julia to LLVM 16 (#51720)

matthias314 commented 1 month ago

@gbaraldi In my example, iterate(sh) is implemented recursively over shuffles of smaller length. With 7 parameters, one therefore has many state variables:

julia> sizeof(iterate(sh)[2])
240

This sounds indeed like the "lots of temporary variables" mentioned in llvm/llvm-project#78506.

matthias314 commented 4 weeks ago

The problem is also present in master (LLVM 18), but to a smaller extent: There I see a 2.4x slowdown instead of 5.4x for 1.11.0-rc2.

Julia Version 1.12.0-DEV.989
Commit 40ecf69019 (2024-08-05 12:54 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

As mentioned in the updated OP, one can now use the latest published version (v0.3.0) of SmallCollections.jl to reproduce the issue.

nsajko commented 4 weeks ago

possible dup: #52933

JuliaLang / julia

5x slowdown in 1.11.0-rc1 compared to 1.10.4 #55227