JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.84k stars 5.49k forks source link

specialize loops inside functions #5654

Open johnmyleswhite opened 10 years ago

johnmyleswhite commented 10 years ago

While trying to optimize performance of operations on DataFrames, I've found that a recurring pattern underlies a lot of the performance bottlenecks people encounter: functions that take in DataFrames as input don't have enough type information to specialize their implementations for the specific types of the columns stored in a DataFrame. This means that a two-part function that (a) selects a subset of columns to compute on and then (b) performs an intensive computation on those columns can't take advantage of the type information available at the end of part A to improve the performance of part B.

It's possible to articulate this pattern without any reference to DataFrames: you get the exact same effect if you try pulling whole columns out of a Vector{Any} and then computing on them. The only workaround I've found is to use an inner function defined inside of the two-part function. This inner function helps to defer type specialization until a time at which it can be carried out effectively.

The three examples below highlight this problem:

function dot1(vals::Vector{Any})
    a, b = vals[1], vals[2]
    x = 0.0
    for i in 1:length(a)
        x += a[i] * b[i]
    end
    return x
end

function dot2(a::Vector, b::Vector)
    x = 0.0
    for i in 1:length(a)
        x += a[i] * b[i]
    end
    return x
end

function dot3(vals::Vector{Any})
    a, b = vals[1], vals[2]
    function inner(a, b)
        x = 0.0
        for i in 1:length(a)
            x += a[i] * b[i]
        end
        return x
    end
    return inner(a, b)
end

If you run these functions with the following data, you get very substantial performance differences between the initial naive implementation and the other two:

n = 1_000_000
a, b = randn(n), randn(n)
vals = {a, b}
dot1(vals)
dot2(a, b)
dot3(vals)

@elapsed dot1(vals)
@elapsed dot2(a, b)
@elapsed dot3(vals)

On my machine, I get relative timings of the following orders of magnitude:

It would be a huge gain to future users of DataFrames if there were a way to make the compiler recognize that a little bit of second-stage type inference could give major performance gains.

cdsousa commented 10 years ago

The dot3(a, b) warm up must be dot3(vals). In my system dot3 is as performant as dot2, maybe you get that time in dot3 due to it not being correctly warmed up...

johnmyleswhite commented 10 years ago

You're right. Thanks for catching that!

quinnj commented 10 years ago

Superseded by #7311.

simonster commented 10 years ago

@quinnj I'm not sure how staged functions solve this. Can you elaborate?

JeffBezanson commented 10 years ago

Yeah, this is not the same issue.

quinnj commented 10 years ago

Sorry. I thought that

if there were a way to make the compiler recognize that a little bit of second-stage type inference could give major performance gains

was exactly what staged functions were getting us.

JeffBezanson commented 10 years ago

The work item here is to identify poorly-typed loops, and carve them out into separate functions that can be specialized at run time. It's an extra compiler optimization pass.

timholy commented 10 years ago

@quinnj, a staged function will receive that ::Vector{Any} as its input, so it won't have any more type information than the parent.

Keno commented 10 years ago

While that is true, it can actually throw an error in that case, in which case it will be called again at runtime with the correct types. That's part of the magic :).

timholy commented 10 years ago

Not sure I understand. What if I explicitly feed it an Any[[1,2,3],[1.0,2.0,3.0]]? At no point will you get good type information about this object.

Now, if you wrote it as

function dot4{T}(vals::T)
...

and passed it ([1,2,3],[1.0,2.0,3.0]) then it would be a different story. But you don't need a staged function for that.

Keno commented 10 years ago

I think the issue is a case where the runtime type information is there but the compile time one isn't so the loop won't be optimized. If there's never good type information then we can't do anything anyway.

barucden commented 2 years ago

Is this still an issue? On 1.7.2, I see this result on my machine:

julia> using BenchmarkTools

julia> n = 1_000_000
1000000

julia> a, b = randn(n), randn(n);

julia> vals = Any[a, b];

julia> function dot1(vals::Vector)
           a, b = vals[1], vals[2]
           x = 0.0
           for i in 1:length(a)
               x += a[i] * b[i]
           end
           return x
       end
dot1 (generic function with 1 method)

julia> function dot2(a::Vector, b::Vector)
           x = 0.0
           for i in 1:length(a)
               x += a[i] * b[i]
           end
           return x
       end
dot2 (generic function with 1 method)

julia> function dot3(vals::Vector)
           a, b = vals[1], vals[2]
           function inner(a, b)
               x = 0.0
               for i in 1:length(a)
                   x += a[i] * b[i]
               end
               return x
           end
           return inner(a, b)
       end
dot3 (generic function with 1 method)

julia> @btime dot1(vals);
  158.501 ms (6998980 allocations: 122.05 MiB)

julia> @btime dot2(a, b);
  1.528 ms (1 allocation: 16 bytes)

julia> @btime dot3(vals);
  1.529 ms (1 allocation: 16 bytes)

EDIT: Updated the measurement according to @Keno's remark. The issue still stands.

Keno commented 2 years ago

The issue is for vals = Any[a, b];. The {} syntax was removed many years ago, but corresponds to Any[].