JuliaData / TypedTables.jl

Simple, fast, column-based storage for data analysis in Julia
Other
147 stars 25 forks source link

`Base.getindex` over allocating #104

Open m-wells opened 1 year ago

m-wells commented 1 year ago

I have a pull request that addresses this issue.

I came across this while developing my package LazyTables.jl

This should greatly improve performance of TypedTables (although for complex row by row operations LazyTables should still outperform).

As you can see for a simply indexing operation TypedTables makes 24 allocations when it only needs to make one.

(@v1.8) pkg> free TypedTables
   Resolving package versions...
    Updating `~/.julia/environments/v1.8/Project.toml`
  [9d95f2ec] ~ TypedTables v1.4.1 `~/.julia/dev/TypedTables` ⇒ v1.4.1
    Updating `~/.julia/environments/v1.8/Manifest.toml`
  [9d95f2ec] ~ TypedTables v1.4.1 `~/.julia/dev/TypedTables` ⇒ v1.4.1

julia> using TypedTables
[ Info: Precompiling TypedTables [9d95f2ec-7b3d-5a63-8d20-e2491e220bb9]
WARNING: method definition for broadcasted at /home/mark/.julia/packages/TypedTables/dycVq/src/DictTable.jl:301 declares type variable N but does not use it.
WARNING: method definition for broadcasted at /home/mark/.julia/packages/TypedTables/dycVq/src/DictTable.jl:305 declares type variable N but does not use it.
WARNING: method definition for broadcasted at /home/mark/.julia/packages/TypedTables/dycVq/src/DictTable.jl:309 declares type variable N but does not use it.
WARNING: method definition for broadcasted at /home/mark/.julia/packages/TypedTables/dycVq/src/DictTable.jl:313 declares type variable N but does not use it.
WARNING: method definition for broadcasted at /home/mark/.julia/packages/TypedTables/dycVq/src/DictTable.jl:317 declares type variable N but does not use it.

julia> table = Table(NamedTuple(Symbol(s) => rand(rand((Bool, Float32, Int, Float64)), 1000) for s in Char.(97:122)));

julia> table[3]; @time table[3];
  0.000011 seconds (24 allocations: 1.250 KiB)
(@v1.8) pkg> dev TypedTables
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.8/Project.toml`
  No Changes to `~/.julia/environments/v1.8/Manifest.toml`

julia> using TypedTables

julia> table = Table(NamedTuple(Symbol(s) => rand(rand((Bool, Float32, Int, Float64)), 1000) for s in Char.(97:122)));

julia> table[3]; @time table[3];
  0.000004 seconds (1 allocation: 144 bytes)
m-wells commented 1 year ago

Pinging @quinnj and @andyferris for visibility.

andyferris commented 1 year ago

Thanks for noticing this and the contribution @m-wells

andyferris commented 1 year ago

@m-wells LazyTables seems really cool - some great ideas there. I've been wanting a "lazy" row for a long time but never got around to it.

To be honest I haven't had the bandwidth to give many of the packages I started the love they deserve. If you really want "All the good of TypedTables.jl but FASTER and without as many allocations!" that sounds really great to me - I'd probably use it myself! :) All that is to say, if you want contribute to and help maintain TypedTables you would be very welcome to work here (I don't have much attachment to the existing code and hopefully wouldn't get in your way, and a breaking change to the eltype of Table would likely be worth it in the long run (we have #105 to fix anyway)). Does that interest you? (If on the other hand you want to make many deeply breaking changes, or are just exploring, that's totally understandable).

m-wells commented 1 year ago

@andyferris

To be honest I haven't had the bandwidth to give many of the packages I started the love they deserve.

I understand that 100%.

All that is to say, if you want contribute to and help maintain TypedTables you would be very welcome to work here (I don't have much attachment to the existing code and hopefully wouldn't get in your way, and a breaking change to the eltype of Table would likely be worth it in the long run (we have https://github.com/JuliaData/TypedTables.jl/issues/105 to fix anyway)). Does that interest you?

Sure.

(If on the other hand you want to make many deeply breaking changes, or are just exploring, that's totally understandable).

If you are willing to drop NamedTuple as the eltype then most of the wrapper code in src/columnops.jl can go away. That is only needed because of the generation of NamedTuples for each iteration. If we implemented a TypedRow (analogous to how I implemented the LazyRow) then we could substantially simplify the backend. This wouldn't be needed anymore https://github.com/JuliaData/TypedTables.jl/blob/0b86a6a6ea3f9a871f9fc16f3de09538cef083c4/src/TypedTables.jl#L16-L43 as the element type would be fully inferrable from compile time just from the column types. (This would also make LazyTables redundant, which is okay with me.)

The definition of Table could become

struct Table{K, N, V<:Tuple{Vararg{AbstractArray{<:Any,N}}}} <: AbstractArray{TypedRow{K, N, V}, N}
    data::NamedTuple{K, V}
end

I have other ideas and none of them involve changing how users interface with Table, just backend stuff.

andyferris commented 1 year ago

If you are willing to drop NamedTuple as the eltype then most of the wrapper code in src/columnops.jl can go away

True that. And yes, I am willing.

The only thing I might add is it would be nice if there were some kind of row type shared between different different tables (like the other tables here like the dictionary ones, but also it would be good to at least theoretically be able to support other Tables.jl tables or even DataFrames DataFrameRow in future). Do you think it's worth having an AbstractRow{names, T <: Tuple} type? And in general would Union{NamedTuple{names, T}, AbstractRow{names{T}} be usable for inserting rows, comparing rows (things like ==, isequal, isless and hash), and share the named tuple interface (meaning propertynames + getproperty)?

andyferris commented 1 year ago

In the meantime @m-wells I have added you as a maintainer. Welcome, and thank you! :)

Please free to merge #103. You are also welcome to make releases (following semver) - let me know if you aren't sure how that works. As a general development process it would be nice to follow a convention of having pull requests reviewed/approved, but if people are not available for please continue to make progress rather than getting stuck (use your judegement how long is too long).

m-wells commented 1 year ago

The only thing I might add is it would be nice if there were some kind of row type shared between different different tables (like the other tables here like the dictionary ones, but also it would be good to at least theoretically be able to support other Tables.jl tables or even DataFrames DataFrameRow in future). Do you think it's worth having an AbstractRow{names, T <: Tuple} type? And in general would Union{NamedTuple{names, T}, AbstractRow{names{T}} be usable for inserting rows, comparing rows (things like ==, isequal, isless and hash), and share the named tuple interface (meaning propertynames + getproperty)?

I don't see a particular use case for an AbstractRow (there is already the Tables.Row struct, which is essentially what you are describing). In fact any Tables.jl object can be called as a Tables.columntable which would return the columns as a NamedTuple. This is all you need to make a typed table. In fact because TypedRow has to be defined before Table

struct TypedRow{K, V} <: Tables.AbstractRow
    data::NamedTuple{K, V}
    row::Int
end

struct Table{K, N, V} <: AbstractArray{TypedRow{K, V}, N}
    data::NamedTuple{K, V}
    # inner constructor to check equal length columns perhaps
end

TypedRow can be reused throughout TypedTables.jl although if the underlying table (like FlexTable) is not type stable it might make more sense to have a generic Row.

struct FlexRow{A<:AbstractArray} <: Tables.AbstractRow
    table::A
    row::Int
end

I haven't played with Dictionaries enough to mess with DictTable but it looks like it could also use the TypedRow.

anirudh2 commented 1 year ago

I'm tacking onto this issue since it's related, but please let me know if you'd prefer that I open a new issue.

I noticed that CI fails for julia 1.0.5 with @m-wells's awesome updates to Base.getindex. In particular, it fails on the inferred type checking of @test @inferred(mapreduce(s, (acc, row) -> acc + row.sum, t; init = 0.0)) === 18.0.

I locally modified the Base.getindex implementations as follows:

@inline function Base.getindex(t::Table{T}, i::Int) where {T}
    @boundscheck checkbounds(t, i)
    old = convert(T, map(col -> @inbounds(getindex(col, i)), columns(t)))::T
    new = T(@inbounds(getindex(col, i)) for col in columns(t))
    @assert old === new
    return new
end

@inline function Base.getindex(t::Table{T}, i::Int...) where {T}
    @boundscheck checkbounds(t, i...)
    old = convert(T, map(col -> @inbounds(getindex(col, i...)), columns(t)))::T
    new = T(@inbounds(getindex(col, i...) for col in columns(t)))
    @assert old === new
    return new
end

Technically, for the failing test, only the first one matters, but I did both just for completeness.

Both @assert checks pass perfectly fine. However, when I run the test suite, returning old passes, but returning new fails.

@Select and @Compute on Tables: Error During Test at /home/anirudh/projects/TypedTables.jl/test/Table.jl:169
  Test threw exception
  Expression: #= /home/anirudh/projects/TypedTables.jl/test/Table.jl:169 =# @inferred(mapreduce(s, ((acc, row)->begin
                    acc + row.sum
                end), t; init=0.0)) === 18.0
  return type Float64 does not match inferred return type Any
  Stacktrace:
   [1] error(::String) at ./error.jl:33
   [2] macro expansion at /home/anirudh/projects/TypedTables.jl/test/Table.jl:169 [inlined]
   [3] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1083 [inlined]
   [4] macro expansion at /home/anirudh/projects/TypedTables.jl/test/Table.jl:151 [inlined]
   [5] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1083 [inlined]
   [6] top-level scope at /home/anirudh/projects/TypedTables.jl/test/Table.jl:2

Honestly, I'm a bit stuck here. My understanding of a === b is that a and b are indistinguishable from one another, so I'm not sure how the assert could pass but the test could fail in this case. If anyone understands what's going on here, I'd love some help!

anirudh2 commented 1 year ago

I'm still not sure why the changed submitted by @m-wells broke that test of Julia 1.0.5, but adding ::T as follows seems to have fixed the problem for me.

T(@inbounds(getindex(col, i)) for col in columns(t))::T

On Julia 1.0.5, this does add an allocation for some reason, but it doesn't add an allocation on Julia 1.8 (still just 1 allocation total as shown in the earlier example by @m-wells). If you think this is an acceptable solution, I can open a PR. I'd still be curious if someone has an explanation for what was wrong before though.