Open lmiq opened 1 year ago
This probably is associated to the complexity of the lowered >(1)
when the argument is a float and the comparison is to an integer, which is quite different from what happens when the types are the same:
julia> @code_llvm 1>(1)
; @ operators.jl:369 within `>`
define i8 @"julia_>_916"(i64 signext %0, i64 signext %1) #0 {
top:
; ┌ @ int.jl:83 within `<`
%2 = icmp slt i64 %1, %0
; └
%3 = zext i1 %2 to i8
ret i8 %3
}
julia> @code_llvm 1.0>(1)
; @ operators.jl:369 within `>`
define i8 @"julia_>_920"(double %0, i64 signext %1) #0 {
top:
; ┌ @ float.jl:576 within `<`
; │┌ @ float.jl:159 within `Float64`
%2 = sitofp i64 %1 to double
; │└
; │ @ float.jl:577 within `<` @ float.jl:535
%3 = fcmp olt double %2, %0
; │ @ float.jl:577 within `<`
; │┌ @ float.jl:533 within `==`
%4 = fcmp oeq double %2, %0
%5 = fcmp oeq double %2, 0x43E0000000000000
; │└
; │┌ @ float.jl:335 within `unsafe_trunc`
%6 = fptosi double %2 to i64
%7 = freeze i64 %6
; │└
; │ @ float.jl:577 within `<` @ int.jl:83
%8 = icmp sgt i64 %7, %1
; │ @ float.jl:577 within `<`
; │┌ @ bool.jl:39 within `|`
%9 = or i1 %5, %8
; │└
; │┌ @ bool.jl:38 within `&`
%10 = and i1 %4, %9
; │└
; │┌ @ bool.jl:39 within `|`
%11 = or i1 %3, %10
; └└
%12 = zext i1 %11 to i8
ret i8 %12
}
which is strikingly different from what this is lowered to:
julia> f(x) = x > 1
f (generic function with 1 method)
julia> @code_llvm f(1.0)
; @ REPL[41]:1 within `f`
define i8 @julia_f_1050(double %0) #0 {
top:
; ┌ @ operators.jl:369 within `>`
; │┌ @ float.jl:577 within `<` @ float.jl:535
%1 = fcmp ogt double %0, 1.000000e+00
; └└
%2 = zext i1 %1 to i8
ret i8 %2
}
Correct, x->x>1
is specialized on the constant 1
, while in >(1)
the 1
remains an argument, stored in the Fix2
object. I agree this is pretty unfortunate and unexpected. A great example of the problems with performance models of functional programming... I don't have an easy answer.
Could that be treated if 1 is constant in local calling scope? I. E.
julia> f1(mat) = sum(>(1), mat)
f1 (generic function with 1 method)
julia> f2(mat) = sum(x -> x > 1, mat)
f2 (generic function with 1 method)
julia> @btime f1($mat)
238.681 μs (0 allocations: 0 bytes)
159024
julia> @btime f2($mat)
77.437 μs (0 allocations: 0 bytes)
159024
Correct, x->x>1 is specialized on the constant 1, while in >(1) the 1 remains an argument, stored in the Fix2 object.
I don't think that's the difference, otherwise the >(1.0)
version would be slow too, but it's just as fast as x -> x > 1
and x -> x > 1.0
!
julia> @btime f1($mat)
98.850 μs (0 allocations: 0 bytes)
158641
julia> @btime f2($mat)
65.940 μs (0 allocations: 0 bytes)
158641
julia> f3(mat) = sum(>(1.0), mat)
f3 (generic function with 1 method)
julia> f4(mat) = sum(x -> x > 1.0, mat)
f4 (generic function with 1 method)
julia> @btime f3($mat)
64.769 μs (0 allocations: 0 bytes)
158641
julia> @btime f4($mat)
65.779 μs (0 allocations: 0 bytes)
158641
Note that f3 & f4 have the same performance - so IMO there's more going on here than just callable structs being inherently slower.
Copying my comment from Zulip:
I don't think that's correct. What's happening here is that when you call
(x -> x > 1)(::Float64)
, it specializes at compile time and realizes that the1
can be repalced with a1.0
, which makes the conversion quicker.However, if you have a non-inlined
(>(x::Int))(::Float64)
, then the compiler can't do the conversion at compile time, so for every comparison it has to takex
and turn it into aFloat64
.Here's a simple demo replicating that by using a type asserted global:
julia> x::Int = 1 1 julia> @btime sum(y -> y > x, $mat); 237.850 μs (0 allocations: 0 bytes) julia> @btime sum(>(x), $mat); 238.181 μs (0 allocations: 0 bytes)
In this situation both functions are the same perf because they both can't convert to
Float64
at compile time.
To transfer what I've shared on zulip - with this function:
julia> code_llvm(Tuple{Matrix{Float64}}; debuginfo=:none) do mat
sum(>(1.0), mat)
end
I do see a constant 1.0
compare in LLVM IR on master:
%26 = fcmp ogt <8 x double> %wide.load, <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>
%27 = fcmp ogt <8 x double> %wide.load108, <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>
%28 = fcmp ogt <8 x double> %wide.load109, <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>
%29 = fcmp ogt <8 x double> %wide.load110, <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>
While I don't see the same with >(1)
. Since both of those are callable structs, I suspect that it's not the fact that >(1)
is a callable struct that leads to the performance penalty relative to x -> x > 1
.
There is a performance penalty in using
>(1)
instead of a plain anonymous function here:The penalty occurs only for
Int64
, if>(1.0)
is used, the good performance is recovered:This happens with Julia 1.9-rc3 or 1.8.5, at least (the example has a matrix as input, but it is the same with a vector).
(followup from: https://discourse.julialang.org/t/performance-penaly-of-1-vs-x-x-1/98547)