JuliaArrays / AxisArrays.jl

Performant arrays where each dimension can have a named axis with values
http://JuliaArrays.github.io/AxisArrays.jl/latest/
Other
200 stars 42 forks source link

Faster indexing with Categorical axes #10

Open tshort opened 9 years ago

tshort commented 9 years ago

With named columns, indexing based on a symbol slows down tight loops. DataFrames has the same problem. It's tough to get fast, type-stable code for this. Here is an example:

using AxisArrays

N = 10_000_000
A = AxisArray(rand(10000000,2), (1:N,[:a,:b]));

function f(x, j)
    res = 0.0
    for i in 1:size(x,1)
        res += x[i,j]
    end
    res
end

@time f(A, 2)  #  elapsed time: 0.028395385 seconds (117 kB allocated)

@time f(A, :b) #  elapsed time: 0.147722792 seconds (364 kB allocated)

I tried using types as column names to speed things up, but the resulting code isn't type stable, and it's even slower. Here's my attempt:


## Bunch of kludgy stuff
AxisArrays.checkaxis{T}(::Type{Val{T}}) = true
stagedfunction AxisArrays.axisindexes{T,S}(::Type{Val{T}}, ::Type{Val{S}})
    i = findfirst(T, S)
    :($i)
end
Base.length{T}(::Type{Val{T}}) = length(T)
Base.size{T}(::Type{Val{T}}) = size(T)
Base.size{T}(::Type{Val{T}},i) = size(T,i)

B = AxisArray{Float64,2,Array{Float64,2},(:row,:col),(UnitRange{Int64},DataType)}(rand(N,2), (1:N,Val{(:a,:b)}));

@time f(B, Val{:b}) # elapsed time: 0.982083383 seconds (461 MB allocated, 3.26% gc time in 20 pauses with 0 full sweep)
@time f(B, 2)       # elapsed time: 0.019993164 seconds (88 kB allocated)
tshort commented 9 years ago

I'm having trouble figuring out why the code above that uses Val{} is not type stable. The problem is with axisindexes used inside getindex:

julia> @code_warntype B[1, Val{:b}]
Variables:
  A::AxisArrays.AxisArray{Float64,2,Array{Float64,2},(:row,:col),(UnitRange{Int64},DataType)}
  idxs::(Int64,Type{Val{:b}})

Body:
  begin  # /home/tshort/.julia/v0.4/AxisArrays/src/indexing.jl, line 135:
      GenSym(1) = (top(tupleref))(idxs::(Int64,Type{Val{:b}}),1)::Int64
      GenSym(0) = (top(tupleref))(idxs::(Int64,Type{Val{:b}}),2)::Type{Val{:b}}
      return getindex(A::AxisArrays.AxisArray{Float64,2,Array{Float64,2},(:row,:col),(UnitRange{Int64},DataType)},GenSym(1),
        axisindexes((top(tupleref))((top(getfield))(A::AxisArrays.AxisArray{Float64,2,Array{Float64,2},(:row,:col),(UnitRange{Int64},DataType)},:axes)::(UnitRange{Int64},DataType),2)::DataType,GenSym(0))::Any)::Any
  end::Any

But, if I call axisindexes directly, it's type stable:

julia> @code_warntype AxisArrays.axisindexes(Val{(:a,:b)}, Val{:b})
Variables:
  #s60::Type{Val{(:a,:b)}}
  #s59::Type{Val{:b}}

Body:
  begin 
      return 2
  end::Int64
mbauman commented 9 years ago

Yeah, this is tough. Since the axis values aren't in the type domain, we can't do the same sort of clever compile-time lookup like we can with the axis names. As the code currently stands the slowdown doesn't come from a type-instability, it just comes from searching for the column whose name matches :b in every single loop. We have an advantage over DataFrames in that all columns will always be of the same type, so even though we don't know which column will be selected, we know that the result will be a Vector{Float64}.

This gets exponentially worse as you add more columns, too, so I agree there needs to be a solution here. One option would be to simply expose a nicer interface to AxisArrays.axisindexes(A.axes[2], :b), and strongly recommend that in tight loops the user hoist the lookup out of the loop.

A slightly more attractive option would be to use custom cartesian iterators. I think this f could be expressed like this:

function f(x, j)
    res = 0.0
    for i in eachindex(x, Axis{:col}(:b)) # == eachindex(x[Axis{:col}(:b)]) == eachindex(x[:,:b])
        res += x[i]
    end
    res
end

This could be written to be type-stable, and the column lookup would occur at iterator construction before the loop.

It'd also be interesting to see why Julia thinks this value can change, and maybe see if we can get the compiler to do the hoisting… but I'm afraid that this computation is a few orders of magnitude too complicated for that.

mbauman commented 9 years ago

The type instability in your Val code is because you're not fully putting the axis values into the parameters of B - the parameter is simply DataType. Try using

B = AxisArray{Float64,2,Array{Float64,2},(:row,:col),(UnitRange{Int64},Type{Val{(:a,:b)}})}(rand(N,2), (1:N,Val{(:a,:b)}))

Julia still isn't able to fully inline the result of the computation, though. I'm not sure why. It may just be that this is too complicated for inference to follow in its limited number of passes.

kmsquire commented 5 years ago

This is quite old, but recent timings of the original code aren't quite as bad (2x vs 5x)

using BenchmarkTools

using AxisArrays

N = 10_000_000
A = AxisArray(rand(10000000,2), (1:N,[:a,:b]));

function f(x, j)
    res = 0.0
    for i in 1:size(x,1)
        res += x[i,j]
    end
    res
end

julia> @btime f(A,2)
  9.891 ms (1 allocation: 16 bytes)
4.998882741998753e6

julia> @btime f(A,:b)
  18.286 ms (1 allocation: 16 bytes)
4.998882741998753e6

Beyond encoding the column names as part of the type, is there much else to be done? Could constant propagation inline :b into the code (or is it doing so already)?