codegen regression for simple loops

stevengj commented 10 years ago

From the mailing list, the following code:

function f{T}(x::T)
    result::T = zero(T)
    for n = one(T):x
        result += x
    end
    result
end

generates a fairly straightforward loop with code_native(f, (Int,)) for Julia 0.2:

Source line: 4
    push    RBP
    mov RBP, RSP
    xor EAX, EAX
    test    RDI, RDI
    jle 20
    mov ECX, 1
Source line: 4
    add RAX, RDI
    inc RCX
    cmp RCX, RDI
Source line: 3
    jle -15
Source line: 6
    pop RBP
    ret

but generates an insane mess for the current master:

Source line: 4
    push    RBP
    mov RBP, RSP
    push    R15
    push    R14
    push    R12
    push    RBX
    mov R15, RDI
    xor EBX, EBX
    test    R15, R15
Source line: 4
    cmovns  RBX, R15
    movabs  RAX, 4320451144
Source line: 3
    mov R14, QWORD PTR [RAX]
    mov R12D, 1
Source line: 4
    movabs  RAX, 4329045984
    mov EDI, 1
    mov RSI, RBX
    call    RAX
    inc RBX
    cmp RAX, RBX
    je  83
    lea RCX, QWORD PTR [R15 + 1]
    test    R15, R15
    cmovg   R12, RCX
    mov RCX, R12
    sub RCX, RAX
    mov RDX, RCX
    and RDX, -4
    je  13
    add RAX, RDX
    add RDX, -4
    jne -10
    cmp R12, RAX
    je  24
    xor EDX, EDX
    test    R15, R15
    cmovns  RDX, R15
    inc RDX
    sub RDX, RAX
Source line: 5
    dec RDX
    jne -9
Source line: 4
    imul    RCX, R15
    add R14, RCX
Source line: 7
    mov RAX, R14
    pop RBX
    pop R12
    pop R14
    pop R15
    pop RBP
    ret

More fallout from the Range changes?

stevengj commented 10 years ago

On the mailing list, it was also mentioned that gcc and clang can recognize that this code is computing f(n) = n^2 and compute it as such. It's curious that we aren't getting the same optimization out of LLVM but I'm not too concerned unless that is a symptom of a deeper problem. (I don't generally view it as the compiler's job to sum series for me, and I'm skeptical that this has much real-world benefit.)

jakebolewski commented 10 years ago

Is there an actual performance regression though?

JeffBezanson commented 10 years ago

I'm going to pin the blame on the loop vectorizer. Disabling FPM->add(createLoopVectorizePass()); yields much more compact code, which also seems to include the x^2 optimization. That is an improvement from 0.2 probably due to my change to lowering of loops to make them match LLVM's patterns better.

cc @archrobison

I suspect we should not run the loop vectorizer on every function.

ArchRobison commented 10 years ago

I agree that the loop vectorizer should be keeping its paws off that loop, particularly if it's not going to emit vector instructions. I'll look into why the loop vectorizer is affecting it.

cbecker commented 10 years ago

This is very interesting though

Julia v0.2

julia> @time f( int(1e9) )
elapsed time: 0.408083256 seconds (80 bytes allocated)
1000000000000000000

julia> versioninfo()
Julia Version 0.2.1
Commit e44b593 (2014-02-11 06:30 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
  LAPACK: libopenblas
  LIBM: libopenlibm

Julia v0.3 (master a few commits ago today)

julia> @time f(int(1e9))
elapsed time: 0.069002847 seconds (80 bytes allocated)
1000000000000000000

Julia Version 0.3.0-prerelease+2405
Commit d265f4d* (2014-04-02 07:05 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
  LAPACK: libopenblas
  LIBM: libopenlibm

@ArchRobison I wonder whether the loop vectorizer is making things worse or better here, but there is obviously a big performance kick.

carlobaldassi commented 10 years ago

I also see a 4-fold improvement from 0.2 to 0.3 (on a Intel i5).

StefanKarpinski commented 10 years ago

I'm seeing a 7.5x speedup from 0.2 to the crazy-ass 0.3 version.

jakebolewski commented 10 years ago

Always profile :-)

cbecker commented 10 years ago

Now, disabling the code vectorizer line in codegen.cpp I get 0.38 sec, so the vectorizer is indeed doing something amazing !

JeffBezanson commented 10 years ago

If I add a loop deletion pass after the vectorization pass, it does cut out a bunch of the code but there doesn't seem to be a performance difference.

StefanKarpinski commented 10 years ago

This is one of the few places where I think genetic programming would be really helpful.

ArchRobison commented 10 years ago

Adding the loop deletion pass sounds like the right answer. Clang (from LLVM trunk) appears to apply the loop deletion pass several passes before the loop vectorizer. I guess we need to experiment with where to place it (or used Stefan's genetic scheme).

From inspecting code_llvm(f, (Int,)), it appears that the vectorizer is happily vectorizing the code. Variable result is just an induction variable, thus its live-out value has a closed-form formula. So the vectorized loop body becomes empty except for the loop counter. The code is even weirder with LLVM 3.5, where some pass (I don't know which) applies Duff's device to the useless loop. [See correction further below - it's not Duff's Device.]

StefanKarpinski commented 10 years ago

Isn't Duff's device just loop unrolling aside from the peculiar/brilliant way of expressing it in C?

ArchRobison commented 10 years ago

The loop deletion pass is in the right place (where Clang has it) in src/codegen.cpp, just before the loop unroller. But it's commented out.

I tried enabling the loop deletion pass in both the commented-out place, and in the later position immediately after the vectorizer. Running in the earlier position generated slightly fewer instructions.

ArchRobison commented 10 years ago

@StefanKarpinski : Upon rereading the LLVM, I see it's not really Duff's device, since the switch statement does not branch into the loop. The LLVM code is just a switch with fall-throughs, to deal with the remainder portion of the chunked useless loop..

simonster commented 10 years ago

Huh, it also looks like start(UnitRange{Int}) isn't being inlined.

simonster commented 10 years ago

Now it's inlined, and the useless loop is 8x faster, because it moves in steps of 32 instead of 4. (EDIT: This is architecture-dependent. Sometimes it moves in steps of 128.) 8x performance boost from #3796, w00t.

cbecker commented 10 years ago

I can also see a huge speed gain, 120x compared to v0.2.1 (0.003sec vs 0.38sec) This is very good news!

carlobaldassi commented 10 years ago

With latest master I now see both a drastic reduction in codegen (w.r.t. yesterday) and a huge speedup (gain w.r.t. v0.2 went from 2-fold to 25-fold).

v0.3:

julia> code_native(f, (Int,))
    .text
Filename: none
Source line: 2
    push    RBP
    mov RBP, RSP
Source line: 2
    mov RAX, QWORD PTR [33337400]
    test    RDI, RDI
    jle 68
    mov RCX, RDI
    and RCX, -32
    lea RDX, QWORD PTR [RCX + 1]
    mov ESI, 1
    cmp RDX, 1
    je  13
    add RCX, -32
    jne -10
    mov RSI, RDX
Source line: 4
    lea RCX, QWORD PTR [RDI + 1]
    sub RCX, RSI
    je  9
    dec RCX
    jne -9
Source line: 3
    imul    RDI, RDI
    add RAX, RDI
Source line: 6
    pop RBP
    ret

julia> @time f(10^9)
elapsed time: 0.028972513 seconds (8688 bytes allocated)
1000000000000000000

v0.2:

julia> code_native(f, (Int,))
    .text
Filename: none
Source line: 4
    push    RBP
    mov RBP, RSP
    xor EAX, EAX
    test    RDI, RDI
    jle 20
    mov ECX, 1
Source line: 4
    add RAX, RDI
    inc RCX
    cmp RCX, RDI
Source line: 3
    jle -15
Source line: 6
    pop RBP
    ret

julia> @time f(10^9)
elapsed time: 0.810938788 seconds (8920 bytes allocated)
1000000000000000000

Seems like this particular issue is solved :)

stevengj commented 10 years ago

Isn't there still a loop that does nothing in the code, which bloats the code size if nothing else?

StefanKarpinski commented 10 years ago

It's unclear to me why LLVM is leaving that loop in there...

ArchRobison commented 10 years ago

I suspect it's because the fancy loop passes expect the loop deletion pass to deal with it, and it's commented it out. Here's the spot in src/codegen.cpp:

    FPM->add(createIndVarSimplifyPass());       // Canonicalize indvars
    //FPM->add(createLoopDeletionPass());         // Delete dead loops
    FPM->add(createLoopUnrollPass());           // Unroll small loops
    //FPM->add(createLoopStrengthReducePass());   // (jwb added)

#if LLVM_VERSION_MAJOR == 3 && LLVM_VERSION_MINOR >= 3
    FPM->add(createLoopVectorizePass());        // Vectorize loops
#endif

Putting back the createLoopDeletionPass seems to have no noticeable impact on the time to build user/lib/julia/*.

Keno commented 10 years ago

Closed by @ArchRobison in #6402

JuliaLang / julia

codegen regression for simple loops #6382