JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.73k stars 5.49k forks source link

performance penalty with >(Int(1)) in sum(>(1), array) #49696

Open lmiq opened 1 year ago

lmiq commented 1 year ago

There is a performance penalty in using >(1) instead of a plain anonymous function here:

julia> using BenchmarkTools

julia> mat = randn(1000, 1000);

julia> @btime sum(>(1), $mat)
  557.740 μs (0 allocations: 0 bytes)
158600

julia> @btime sum(x -> x > 1, $mat)
  303.875 μs (0 allocations: 0 bytes)
158600

The penalty occurs only for Int64, if >(1.0) is used, the good performance is recovered:

julia> @btime sum(>(1.0), $mat)
  276.998 μs (0 allocations: 0 bytes)
158862

This happens with Julia 1.9-rc3 or 1.8.5, at least (the example has a matrix as input, but it is the same with a vector).

(followup from: https://discourse.julialang.org/t/performance-penaly-of-1-vs-x-x-1/98547)

lmiq commented 1 year ago

This probably is associated to the complexity of the lowered >(1) when the argument is a float and the comparison is to an integer, which is quite different from what happens when the types are the same:

julia> @code_llvm 1>(1)
;  @ operators.jl:369 within `>`
define i8 @"julia_>_916"(i64 signext %0, i64 signext %1) #0 {
top:
; ┌ @ int.jl:83 within `<`
   %2 = icmp slt i64 %1, %0
; └
  %3 = zext i1 %2 to i8
  ret i8 %3
}

julia> @code_llvm 1.0>(1)
;  @ operators.jl:369 within `>`
define i8 @"julia_>_920"(double %0, i64 signext %1) #0 {
top:
; ┌ @ float.jl:576 within `<`
; │┌ @ float.jl:159 within `Float64`
    %2 = sitofp i64 %1 to double
; │└
; │ @ float.jl:577 within `<` @ float.jl:535
   %3 = fcmp olt double %2, %0
; │ @ float.jl:577 within `<`
; │┌ @ float.jl:533 within `==`
    %4 = fcmp oeq double %2, %0
    %5 = fcmp oeq double %2, 0x43E0000000000000
; │└
; │┌ @ float.jl:335 within `unsafe_trunc`
    %6 = fptosi double %2 to i64
    %7 = freeze i64 %6
; │└
; │ @ float.jl:577 within `<` @ int.jl:83
   %8 = icmp sgt i64 %7, %1
; │ @ float.jl:577 within `<`
; │┌ @ bool.jl:39 within `|`
    %9 = or i1 %5, %8
; │└
; │┌ @ bool.jl:38 within `&`
    %10 = and i1 %4, %9
; │└
; │┌ @ bool.jl:39 within `|`
    %11 = or i1 %3, %10
; └└
  %12 = zext i1 %11 to i8
  ret i8 %12
}

which is strikingly different from what this is lowered to:

julia> f(x) = x > 1
f (generic function with 1 method)

julia> @code_llvm f(1.0)
;  @ REPL[41]:1 within `f`
define i8 @julia_f_1050(double %0) #0 {
top:
; ┌ @ operators.jl:369 within `>`
; │┌ @ float.jl:577 within `<` @ float.jl:535
    %1 = fcmp ogt double %0, 1.000000e+00
; └└
  %2 = zext i1 %1 to i8
  ret i8 %2
}
JeffBezanson commented 1 year ago

Correct, x->x>1 is specialized on the constant 1, while in >(1) the 1 remains an argument, stored in the Fix2 object. I agree this is pretty unfortunate and unexpected. A great example of the problems with performance models of functional programming... I don't have an easy answer.

lmiq commented 1 year ago

Could that be treated if 1 is constant in local calling scope? I. E.

julia> f1(mat) = sum(>(1), mat)
f1 (generic function with 1 method)

julia> f2(mat) = sum(x -> x > 1, mat)
f2 (generic function with 1 method)

julia> @btime f1($mat)
  238.681 μs (0 allocations: 0 bytes)
159024

julia> @btime f2($mat)
  77.437 μs (0 allocations: 0 bytes)
159024
Seelengrab commented 8 months ago

Correct, x->x>1 is specialized on the constant 1, while in >(1) the 1 remains an argument, stored in the Fix2 object.

I don't think that's the difference, otherwise the >(1.0) version would be slow too, but it's just as fast as x -> x > 1 and x -> x > 1.0!

julia> @btime f1($mat)
  98.850 μs (0 allocations: 0 bytes)
158641

julia> @btime f2($mat)
  65.940 μs (0 allocations: 0 bytes)
158641

julia> f3(mat) = sum(>(1.0), mat)
f3 (generic function with 1 method)

julia> f4(mat) = sum(x -> x > 1.0, mat)
f4 (generic function with 1 method)

julia> @btime f3($mat)
  64.769 μs (0 allocations: 0 bytes)
158641

julia> @btime f4($mat)
  65.779 μs (0 allocations: 0 bytes)
158641

Note that f3 & f4 have the same performance - so IMO there's more going on here than just callable structs being inherently slower.

MasonProtter commented 8 months ago

Copying my comment from Zulip:

I don't think that's correct. What's happening here is that when you call (x -> x > 1)(::Float64), it specializes at compile time and realizes that the 1 can be repalced with a 1.0, which makes the conversion quicker.

However, if you have a non-inlined (>(x::Int))(::Float64), then the compiler can't do the conversion at compile time, so for every comparison it has to take x and turn it into a Float64.

Here's a simple demo replicating that by using a type asserted global:

julia> x::Int = 1
1

julia> @btime sum(y -> y > x, $mat);
  237.850 μs (0 allocations: 0 bytes)

julia> @btime sum(>(x), $mat);
  238.181 μs (0 allocations: 0 bytes)

In this situation both functions are the same perf because they both can't convert to Float64 at compile time.

Seelengrab commented 8 months ago

To transfer what I've shared on zulip - with this function:

julia> code_llvm(Tuple{Matrix{Float64}}; debuginfo=:none) do mat
                  sum(>(1.0), mat)
              end

I do see a constant 1.0 compare in LLVM IR on master:

  %26 = fcmp ogt <8 x double> %wide.load, <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>
  %27 = fcmp ogt <8 x double> %wide.load108, <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>
  %28 = fcmp ogt <8 x double> %wide.load109, <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>
  %29 = fcmp ogt <8 x double> %wide.load110, <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>

While I don't see the same with >(1). Since both of those are callable structs, I suspect that it's not the fact that >(1) is a callable struct that leads to the performance penalty relative to x -> x > 1.