Open jwmerrill opened 9 years ago
On 0.2.1:
julia> code_llvm(sin, (Float64,))
define double @julia_sin(double) {
top:
%1 = call double @sin(double %0), !dbg !3342
%2 = fcmp ord double %1, 0.000000e+00, !dbg !3342
%3 = fcmp uno double %0, 0.000000e+00, !dbg !3342
%4 = or i1 %2, %3, !dbg !3342
br i1 %4, label %pass, label %fail, !dbg !3342
fail: ; preds = %top
%5 = load %jl_value_t** @jl_domain_exception, align 8, !dbg !3342
call void @jl_throw_with_superfluous_argument(%jl_value_t* %5, i32 282), !dbg !3342
unreachable, !dbg !3342
pass: ; preds = %top
ret double %1, !dbg !3342
}
On master:
define double @julia_sin_43255(double) {
top:
%1 = call double inttoptr (i64 4668253248 to double (double)*)(double %0), !dbg !265
%2 = fcmp ord double %1, 0.000000e+00, !dbg !265
%3 = fcmp uno double %0, 0.000000e+00, !dbg !265
%4 = or i1 %2, %3, !dbg !265
br i1 %4, label %pass, label %fail, !dbg !265
fail: ; preds = %top
%5 = load %jl_value_t** @jl_domain_exception, align 8, !dbg !265, !tbaa %jtbaa_const
call void @jl_throw_with_superfluous_argument(%jl_value_t* %5, i32 123), !dbg !265
unreachable, !dbg !265
pass: ; preds = %top
ret double %1, !dbg !265
}
I'd guess that for LLVM to constant fold sin
, it has to know that it's named sin
, and somewhere along the line (precompilation?) we stopped giving it that information.
correct. we now force llvm to use the sin in libopenlibm, rather than giving it the freedom to pick any function named sin
I do not get the performance from constant folding, here is the output from code_llvm(sumofsins2, (Int,))
.
julia> code_llvm(sumofsins2, (Int,))
define double @julia_sumofsins2_1172(i64) {
top:
%1 = icmp sgt i64 %0, 0, !dbg !3561
br i1 %1, label %L, label %L3, !dbg !3561
L: ; preds = %top, %pass
%r.0 = phi double [ %6, %pass ], [ 0.000000e+00, %top ]
%"#s3.0" = phi i64 [ %5, %pass ], [ 1, %top ]
%2 = call double inttoptr (i64 1752503104 to double (double)*)(double 3.400000e+00), !dbg !3562
%3 = fcmp ord double %2, 0.000000e+00, !dbg !3562
br i1 %3, label %pass, label %fail, !dbg !3562
fail: ; preds = %L
%4 = load %jl_value_t** @jl_domain_exception, align 8, !dbg !3562, !tbaa %jtbaa_const
call void @jl_throw_with_superfluous_argument(%jl_value_t* %4, i32 4), !dbg !3562
unreachable, !dbg !3562
pass: ; preds = %L
%5 = add i64 %"#s3.0", 1, !dbg !3561
%6 = fadd double %r.0, %2, !dbg !3562
%7 = icmp eq i64 %"#s3.0", %0, !dbg !3562
br i1 %7, label %L3, label %L, !dbg !3562
L3: ; preds = %pass, %top
%r.1 = phi double [ 0.000000e+00, %top ], [ %6, %pass ]
ret double %r.1, !dbg !3565
}
My version info is:
julia> versioninfo()
Julia Version 0.4.0-dev+2847
Commit fc61385 (2015-01-21 18:34 UTC)
Platform Info:
System: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.3
LLVM defines intrinsics for many operations (sqrt
, sin
, etc). It seems we need to generate these intrinsics instead of libcalls to benefit from LLVM's optimizations. Does this sound reasonable? This looks to be straightforward-but-tedious.
We would also need to make the LLVM intrinsics call openlibm instead of the system libm.
Can you remind me why we need to call openlibm? This prevents not only constant folding, but also using hardware instructions if they are available, e.g. vsqrtpd
on Intel.
LLVM tries to intercept libcalls to certain well-known functions, such as sqrt
in the global namespace. I assume (couldn't find the logic yet) that it thus also replaces intrinsics by such libcalls if necessary. To ensure it calls openlibm instead of libm, we would need to declare these openlibm functions to LLVM's code generators.
Regarding making LLVM call openlibm instead of libm: If Julia is configured with UNTRUSTED_SYSTEM_LIBM
, then it uses openlibm instead of libm -- this includes code generated by LLVM. That is:
What am I missing?
Given this, adding new intrinsics is easy. I added the sin
intrinsic for fun -- it's three lines in src/intrinsics.cpp
, and the respective change in base/math.jl
.
See https://github.com/eschnett/julia/tree/math-intrinsics for the prototype implementation.
I looked at http://www.johnmyleswhite.com/notebook/2013/12/06/writing-type-stable-code-in-julia/ again with this branch, and there is unfortunately no speedup. LLVM correctly recognizes sees the sin
intrinsic, but it still doesn't constant-fold. For reference, here is the inner loop as generate on my system:
L78: vmovsd %xmm0, -24(%rbp)
Source line: 12
vmovsd -32(%rbp), %xmm0
callq *%r14
vucomisd %xmm0, %xmm0
jp L132
vmovsd -24(%rbp), %xmm1
vaddsd %xmm0, %xmm1, %xmm1
vmovaps %xmm1, %xmm0
decq %rbx
jne L78
The call
etc. is still there. Thus, to enable constant-folding, it is likely necessary to address #5234.
Update: No speedup when building against LLVM 3.3 (the default), but with LLVM 3.6, the speedup is restored. Apparently LLVM 3.3 still lacks certain optimizations for sin
.
Note that #14324 is also relevant here — this issue is not really specific to libm functions and LLVM intrinsics.
Any idea how to progress on this? This feels a bit sad (latest master)
f(x) = cos(x) + cos(x)
function f2(x)
c = cos(x)
c + c
end
@time for i = 1:10^6 f(2.0) end,
# 0.021374 seconds
@time for i = 1:10^6 f2(2.0) end
# 0.010027 seconds
@KristofferC, that particular case would involve CSE (common subexpression elimination) for @pure
functions, which is a bit different than the issue here.
At this point, we only use libm for expm1
, log2
, log10
, and some rounding functions. log2
and log10
both could probably be moved to pure Julia pretty easily. I'll look into it.
https://groups.google.com/forum/#!topic/julia-users/Jndl9sYwj5Q reports a performance regression in some simple code from a blog post that was meant to illustrate the importance of type stability: http://www.johnmyleswhite.com/notebook/2013/12/06/writing-type-stable-code-in-julia/
The code is
and the blog post gives the output (as of Julia 2.x) of
code_llvm(sumofsins2, (Int,))
In this IR code, the
sin(3.4)
has been constant folded to0xBFD05AC910FF4C6C
.As of Julia 3.4, the new llvm code is
so it looks like the call to
sin
is no longer being constant folded.I haven't checked the generated code on master, but it sounds like the performance regression is still present there.