JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.64k stars 5.48k forks source link

Constant Folding openlibm functions #9942

Open jwmerrill opened 9 years ago

jwmerrill commented 9 years ago

https://groups.google.com/forum/#!topic/julia-users/Jndl9sYwj5Q reports a performance regression in some simple code from a blog post that was meant to illustrate the importance of type stability: http://www.johnmyleswhite.com/notebook/2013/12/06/writing-type-stable-code-in-julia/

The code is

function sumofsins2(n::Integer)  
    r = 0.0  
    for i in 1:n  
        r += sin(3.4)  
    end  
    return r  
end

and the blog post gives the output (as of Julia 2.x) of code_llvm(sumofsins2, (Int,))

define double @julia_sumofsins21068(i64) {  
top:  
  %1 = icmp slt i64 %0, 1, !dbg !5151  
  br i1 %1, label %L2, label %pass, !dbg !5151  

pass:                                             ; preds = %top, %pass  
  %"#s6.04" = phi i64 [ %3, %pass ], [ 1, %top ]  
  %r.03 = phi double [ %2, %pass ], [ 0.000000e+00, %top ]  
  %2 = fadd double %r.03, 0xBFD05AC910FF4C6C, !dbg !5156  
  %3 = add i64 %"#s6.04", 1, !dbg !5156  
  %4 = icmp sgt i64 %3, %0, !dbg !5151  
  br i1 %4, label %L2, label %pass, !dbg !5151  

L2:                                               ; preds = %pass, %top  
  %r.0.lcssa = phi double [ 0.000000e+00, %top ], [ %2, %pass ]  
  ret double %r.0.lcssa, !dbg !5157  
} 

In this IR code, the sin(3.4) has been constant folded to 0xBFD05AC910FF4C6C.

As of Julia 3.4, the new llvm code is

define double @"julia_sumofsins2;20064"(i64) {
top:
  %1 = icmp sgt i64 %0, 0, !dbg !841
  br i1 %1, label %L, label %L3, !dbg !841

L:                                                ; preds = %top, %pass
  %r.0 = phi double [ %6, %pass ], [ 0.000000e+00, %top ]
  %"#s119.0" = phi i64 [ %5, %pass ], [ 1, %top ]
  %2 = call double inttoptr (i64 4551878240 to double (double)*)(double 3.400000e+00), !dbg !842
  %3 = fcmp ord double %2, 0.000000e+00, !dbg !842
  br i1 %3, label %pass, label %fail, !dbg !842

fail:                                             ; preds = %L
  %4 = load %jl_value_t** @jl_domain_exception, align 8, !dbg !842, !tbaa %jtbaa_const
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %4, i32 4), !dbg !842
  unreachable, !dbg !842

pass:                                             ; preds = %L
  %5 = add i64 %"#s119.0", 1, !dbg !841
  %6 = fadd double %r.0, %2, !dbg !842
  %7 = icmp eq i64 %"#s119.0", %0, !dbg !842
  br i1 %7, label %L3, label %L, !dbg !842

L3:                                               ; preds = %pass, %top
  %r.1 = phi double [ 0.000000e+00, %top ], [ %6, %pass ]
  ret double %r.1, !dbg !845
}

so it looks like the call to sin is no longer being constant folded.

I haven't checked the generated code on master, but it sounds like the performance regression is still present there.

simonster commented 9 years ago

On 0.2.1:

julia> code_llvm(sin, (Float64,))

define double @julia_sin(double) {
top:
  %1 = call double @sin(double %0), !dbg !3342
  %2 = fcmp ord double %1, 0.000000e+00, !dbg !3342
  %3 = fcmp uno double %0, 0.000000e+00, !dbg !3342
  %4 = or i1 %2, %3, !dbg !3342
  br i1 %4, label %pass, label %fail, !dbg !3342

fail:                                             ; preds = %top
  %5 = load %jl_value_t** @jl_domain_exception, align 8, !dbg !3342
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %5, i32 282), !dbg !3342
  unreachable, !dbg !3342

pass:                                             ; preds = %top
  ret double %1, !dbg !3342
}

On master:

define double @julia_sin_43255(double) {
top:
  %1 = call double inttoptr (i64 4668253248 to double (double)*)(double %0), !dbg !265
  %2 = fcmp ord double %1, 0.000000e+00, !dbg !265
  %3 = fcmp uno double %0, 0.000000e+00, !dbg !265
  %4 = or i1 %2, %3, !dbg !265
  br i1 %4, label %pass, label %fail, !dbg !265

fail:                                             ; preds = %top
  %5 = load %jl_value_t** @jl_domain_exception, align 8, !dbg !265, !tbaa %jtbaa_const
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %5, i32 123), !dbg !265
  unreachable, !dbg !265

pass:                                             ; preds = %top
  ret double %1, !dbg !265
}

I'd guess that for LLVM to constant fold sin, it has to know that it's named sin, and somewhere along the line (precompilation?) we stopped giving it that information.

vtjnash commented 9 years ago

correct. we now force llvm to use the sin in libopenlibm, rather than giving it the freedom to pick any function named sin

dhoegh commented 9 years ago

I do not get the performance from constant folding, here is the output from code_llvm(sumofsins2, (Int,)).

julia> code_llvm(sumofsins2, (Int,))

define double @julia_sumofsins2_1172(i64) {
top:
  %1 = icmp sgt i64 %0, 0, !dbg !3561
  br i1 %1, label %L, label %L3, !dbg !3561

L:                                                ; preds = %top, %pass
  %r.0 = phi double [ %6, %pass ], [ 0.000000e+00, %top ]
  %"#s3.0" = phi i64 [ %5, %pass ], [ 1, %top ]
  %2 = call double inttoptr (i64 1752503104 to double (double)*)(double 3.400000e+00), !dbg !3562
  %3 = fcmp ord double %2, 0.000000e+00, !dbg !3562
  br i1 %3, label %pass, label %fail, !dbg !3562

fail:                                             ; preds = %L
  %4 = load %jl_value_t** @jl_domain_exception, align 8, !dbg !3562, !tbaa %jtbaa_const
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %4, i32 4), !dbg !3562
  unreachable, !dbg !3562

pass:                                             ; preds = %L
  %5 = add i64 %"#s3.0", 1, !dbg !3561
  %6 = fadd double %r.0, %2, !dbg !3562
  %7 = icmp eq i64 %"#s3.0", %0, !dbg !3562
  br i1 %7, label %L3, label %L, !dbg !3562

L3:                                               ; preds = %pass, %top
  %r.1 = phi double [ 0.000000e+00, %top ], [ %6, %pass ]
  ret double %r.1, !dbg !3565
}

My version info is:

julia> versioninfo()
Julia Version 0.4.0-dev+2847
Commit fc61385 (2015-01-21 18:34 UTC)
Platform Info:
  System: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
eschnett commented 9 years ago

LLVM defines intrinsics for many operations (sqrt, sin, etc). It seems we need to generate these intrinsics instead of libcalls to benefit from LLVM's optimizations. Does this sound reasonable? This looks to be straightforward-but-tedious.

simonster commented 9 years ago

We would also need to make the LLVM intrinsics call openlibm instead of the system libm.

eschnett commented 9 years ago

Can you remind me why we need to call openlibm? This prevents not only constant folding, but also using hardware instructions if they are available, e.g. vsqrtpd on Intel.

LLVM tries to intercept libcalls to certain well-known functions, such as sqrt in the global namespace. I assume (couldn't find the logic yet) that it thus also replaces intrinsics by such libcalls if necessary. To ensure it calls openlibm instead of libm, we would need to declare these openlibm functions to LLVM's code generators.

eschnett commented 9 years ago

Regarding making LLVM call openlibm instead of libm: If Julia is configured with UNTRUSTED_SYSTEM_LIBM, then it uses openlibm instead of libm -- this includes code generated by LLVM. That is:

What am I missing?

Given this, adding new intrinsics is easy. I added the sin intrinsic for fun -- it's three lines in src/intrinsics.cpp, and the respective change in base/math.jl.

eschnett commented 9 years ago

See https://github.com/eschnett/julia/tree/math-intrinsics for the prototype implementation.

eschnett commented 9 years ago

I looked at http://www.johnmyleswhite.com/notebook/2013/12/06/writing-type-stable-code-in-julia/ again with this branch, and there is unfortunately no speedup. LLVM correctly recognizes sees the sin intrinsic, but it still doesn't constant-fold. For reference, here is the inner loop as generate on my system:

L78:    vmovsd  %xmm0, -24(%rbp)
Source line: 12
    vmovsd  -32(%rbp), %xmm0
    callq   *%r14
    vucomisd    %xmm0, %xmm0
    jp  L132
    vmovsd  -24(%rbp), %xmm1
    vaddsd  %xmm0, %xmm1, %xmm1
    vmovaps %xmm1, %xmm0
    decq    %rbx
    jne L78

The call etc. is still there. Thus, to enable constant-folding, it is likely necessary to address #5234.

eschnett commented 9 years ago

Update: No speedup when building against LLVM 3.3 (the default), but with LLVM 3.6, the speedup is restored. Apparently LLVM 3.3 still lacks certain optimizations for sin.

stevengj commented 8 years ago

Note that #14324 is also relevant here — this issue is not really specific to libm functions and LLVM intrinsics.

KristofferC commented 8 years ago

Any idea how to progress on this? This feels a bit sad (latest master)

f(x) = cos(x) + cos(x)

function f2(x)
    c = cos(x)
    c + c
end

@time for i = 1:10^6 f(2.0) end,
#  0.021374 seconds

@time for i = 1:10^6 f2(2.0) end
#  0.010027 seconds
stevengj commented 8 years ago

@KristofferC, that particular case would involve CSE (common subexpression elimination) for @pure functions, which is a bit different than the issue here.

oscardssmith commented 3 years ago

At this point, we only use libm for expm1, log2, log10, and some rounding functions. log2 and log10 both could probably be moved to pure Julia pretty easily. I'll look into it.