Closed stevengj closed 10 years ago
On the mailing list, it was also mentioned that gcc
and clang
can recognize that this code is computing f(n) = n^2
and compute it as such. It's curious that we aren't getting the same optimization out of LLVM but I'm not too concerned unless that is a symptom of a deeper problem. (I don't generally view it as the compiler's job to sum series for me, and I'm skeptical that this has much real-world benefit.)
Is there an actual performance regression though?
I'm going to pin the blame on the loop vectorizer. Disabling FPM->add(createLoopVectorizePass());
yields much more compact code, which also seems to include the x^2
optimization. That is an improvement from 0.2 probably due to my change to lowering of loops to make them match LLVM's patterns better.
cc @archrobison
I suspect we should not run the loop vectorizer on every function.
I agree that the loop vectorizer should be keeping its paws off that loop, particularly if it's not going to emit vector instructions. I'll look into why the loop vectorizer is affecting it.
This is very interesting though
Julia v0.2
julia> @time f( int(1e9) )
elapsed time: 0.408083256 seconds (80 bytes allocated)
1000000000000000000
julia> versioninfo()
Julia Version 0.2.1
Commit e44b593 (2014-02-11 06:30 UTC)
Platform Info:
System: Linux (x86_64-unknown-linux-gnu)
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
LAPACK: libopenblas
LIBM: libopenlibm
Julia v0.3 (master a few commits ago today)
julia> @time f(int(1e9))
elapsed time: 0.069002847 seconds (80 bytes allocated)
1000000000000000000
Julia Version 0.3.0-prerelease+2405
Commit d265f4d* (2014-04-02 07:05 UTC)
Platform Info:
System: Linux (x86_64-unknown-linux-gnu)
CPU: Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
LAPACK: libopenblas
LIBM: libopenlibm
@ArchRobison I wonder whether the loop vectorizer is making things worse or better here, but there is obviously a big performance kick.
I also see a 4-fold improvement from 0.2 to 0.3 (on a Intel i5).
I'm seeing a 7.5x speedup from 0.2 to the crazy-ass 0.3 version.
Always profile :-)
Now, disabling the code vectorizer line in codegen.cpp
I get 0.38 sec, so the vectorizer is indeed doing something amazing !
If I add a loop deletion pass after the vectorization pass, it does cut out a bunch of the code but there doesn't seem to be a performance difference.
This is one of the few places where I think genetic programming would be really helpful.
Adding the loop deletion pass sounds like the right answer. Clang (from LLVM trunk) appears to apply the loop deletion pass several passes before the loop vectorizer. I guess we need to experiment with where to place it (or used Stefan's genetic scheme).
From inspecting code_llvm(f, (Int,))
, it appears that the vectorizer is happily vectorizing the code. Variable result
is just an induction variable, thus its live-out value has a closed-form formula. So the vectorized loop body becomes empty except for the loop counter. The code is even weirder with LLVM 3.5, where some pass (I don't know which) applies Duff's device to the useless loop. [See correction further below - it's not Duff's Device.]
Isn't Duff's device just loop unrolling aside from the peculiar/brilliant way of expressing it in C?
The loop deletion pass is in the right place (where Clang has it) in src/codegen.cpp
, just before the loop unroller. But it's commented out.
I tried enabling the loop deletion pass in both the commented-out place, and in the later position immediately after the vectorizer. Running in the earlier position generated slightly fewer instructions.
@StefanKarpinski : Upon rereading the LLVM, I see it's not really Duff's device, since the switch statement does not branch into the loop. The LLVM code is just a switch with fall-throughs, to deal with the remainder portion of the chunked useless loop..
Huh, it also looks like start(UnitRange{Int})
isn't being inlined.
Now it's inlined, and the useless loop is 8x faster, because it moves in steps of 32 instead of 4. (EDIT: This is architecture-dependent. Sometimes it moves in steps of 128.) 8x performance boost from #3796, w00t.
I can also see a huge speed gain, 120x compared to v0.2.1 (0.003sec vs 0.38sec) This is very good news!
With latest master I now see both a drastic reduction in codegen (w.r.t. yesterday) and a huge speedup (gain w.r.t. v0.2 went from 2-fold to 25-fold).
v0.3:
julia> code_native(f, (Int,))
.text
Filename: none
Source line: 2
push RBP
mov RBP, RSP
Source line: 2
mov RAX, QWORD PTR [33337400]
test RDI, RDI
jle 68
mov RCX, RDI
and RCX, -32
lea RDX, QWORD PTR [RCX + 1]
mov ESI, 1
cmp RDX, 1
je 13
add RCX, -32
jne -10
mov RSI, RDX
Source line: 4
lea RCX, QWORD PTR [RDI + 1]
sub RCX, RSI
je 9
dec RCX
jne -9
Source line: 3
imul RDI, RDI
add RAX, RDI
Source line: 6
pop RBP
ret
julia> @time f(10^9)
elapsed time: 0.028972513 seconds (8688 bytes allocated)
1000000000000000000
v0.2:
julia> code_native(f, (Int,))
.text
Filename: none
Source line: 4
push RBP
mov RBP, RSP
xor EAX, EAX
test RDI, RDI
jle 20
mov ECX, 1
Source line: 4
add RAX, RDI
inc RCX
cmp RCX, RDI
Source line: 3
jle -15
Source line: 6
pop RBP
ret
julia> @time f(10^9)
elapsed time: 0.810938788 seconds (8920 bytes allocated)
1000000000000000000
Seems like this particular issue is solved :)
Isn't there still a loop that does nothing in the code, which bloats the code size if nothing else?
It's unclear to me why LLVM is leaving that loop in there...
I suspect it's because the fancy loop passes expect the loop deletion pass to deal with it, and it's commented it out. Here's the spot in src/codegen.cpp
:
FPM->add(createIndVarSimplifyPass()); // Canonicalize indvars
//FPM->add(createLoopDeletionPass()); // Delete dead loops
FPM->add(createLoopUnrollPass()); // Unroll small loops
//FPM->add(createLoopStrengthReducePass()); // (jwb added)
#if LLVM_VERSION_MAJOR == 3 && LLVM_VERSION_MINOR >= 3
FPM->add(createLoopVectorizePass()); // Vectorize loops
#endif
Putting back the createLoopDeletionPass
seems to have no noticeable impact on the time to build user/lib/julia/*.
Closed by @ArchRobison in #6402
From the mailing list, the following code:
generates a fairly straightforward loop with
code_native(f, (Int,))
for Julia 0.2:but generates an insane mess for the current
master
:More fallout from the
Range
changes?