Open m-wells opened 1 year ago
Pinging @quinnj and @andyferris for visibility.
Thanks for noticing this and the contribution @m-wells
@m-wells LazyTables seems really cool - some great ideas there. I've been wanting a "lazy" row for a long time but never got around to it.
To be honest I haven't had the bandwidth to give many of the packages I started the love they deserve. If you really want "All the good of TypedTables.jl but FASTER and without as many allocations!" that sounds really great to me - I'd probably use it myself! :) All that is to say, if you want contribute to and help maintain TypedTables you would be very welcome to work here (I don't have much attachment to the existing code and hopefully wouldn't get in your way, and a breaking change to the eltype
of Table
would likely be worth it in the long run (we have #105 to fix anyway)). Does that interest you? (If on the other hand you want to make many deeply breaking changes, or are just exploring, that's totally understandable).
@andyferris
To be honest I haven't had the bandwidth to give many of the packages I started the love they deserve.
I understand that 100%.
All that is to say, if you want contribute to and help maintain TypedTables you would be very welcome to work here (I don't have much attachment to the existing code and hopefully wouldn't get in your way, and a breaking change to the eltype of Table would likely be worth it in the long run (we have https://github.com/JuliaData/TypedTables.jl/issues/105 to fix anyway)). Does that interest you?
Sure.
(If on the other hand you want to make many deeply breaking changes, or are just exploring, that's totally understandable).
If you are willing to drop NamedTuple as the eltype then most of the wrapper code in src/columnops.jl
can go away. That is only needed because of the generation of NamedTuples for each iteration. If we implemented a TypedRow
(analogous to how I implemented the LazyRow
) then we could substantially simplify the backend.
This wouldn't be needed anymore https://github.com/JuliaData/TypedTables.jl/blob/0b86a6a6ea3f9a871f9fc16f3de09538cef083c4/src/TypedTables.jl#L16-L43 as the element type would be fully inferrable from compile time just from the column types. (This would also make LazyTables redundant, which is okay with me.)
The definition of Table
could become
struct Table{K, N, V<:Tuple{Vararg{AbstractArray{<:Any,N}}}} <: AbstractArray{TypedRow{K, N, V}, N}
data::NamedTuple{K, V}
end
I have other ideas and none of them involve changing how users interface with Table, just backend stuff.
If you are willing to drop NamedTuple as the eltype then most of the wrapper code in src/columnops.jl can go away
True that. And yes, I am willing.
The only thing I might add is it would be nice if there were some kind of row type shared between different different tables (like the other tables here like the dictionary ones, but also it would be good to at least theoretically be able to support other Tables.jl tables or even DataFrames DataFrameRow
in future). Do you think it's worth having an AbstractRow{names, T <: Tuple}
type? And in general would Union{NamedTuple{names, T}, AbstractRow{names{T}}
be usable for inserting rows, comparing rows (things like ==
, isequal
, isless
and hash
), and share the named tuple interface (meaning propertynames
+ getproperty
)?
In the meantime @m-wells I have added you as a maintainer. Welcome, and thank you! :)
Please free to merge #103. You are also welcome to make releases (following semver) - let me know if you aren't sure how that works. As a general development process it would be nice to follow a convention of having pull requests reviewed/approved, but if people are not available for please continue to make progress rather than getting stuck (use your judegement how long is too long).
The only thing I might add is it would be nice if there were some kind of row type shared between different different tables (like the other tables here like the dictionary ones, but also it would be good to at least theoretically be able to support other Tables.jl tables or even DataFrames
DataFrameRow
in future). Do you think it's worth having anAbstractRow{names, T <: Tuple}
type? And in general wouldUnion{NamedTuple{names, T}, AbstractRow{names{T}}
be usable for inserting rows, comparing rows (things like==
,isequal
,isless
andhash
), and share the named tuple interface (meaningpropertynames
+getproperty
)?
I don't see a particular use case for an AbstractRow
(there is already the Tables.Row struct, which is essentially what you are describing). In fact any Tables.jl object can be called as a Tables.columntable
which would return the columns as a NamedTuple
. This is all you need to make a typed table. In fact because TypedRow
has to be defined before Table
struct TypedRow{K, V} <: Tables.AbstractRow
data::NamedTuple{K, V}
row::Int
end
struct Table{K, N, V} <: AbstractArray{TypedRow{K, V}, N}
data::NamedTuple{K, V}
# inner constructor to check equal length columns perhaps
end
TypedRow
can be reused throughout TypedTables.jl
although if the underlying table (like FlexTable
) is not type stable it might make more sense to have a generic Row
.
struct FlexRow{A<:AbstractArray} <: Tables.AbstractRow
table::A
row::Int
end
I haven't played with Dictionaries
enough to mess with DictTable
but it looks like it could also use the TypedRow
.
I'm tacking onto this issue since it's related, but please let me know if you'd prefer that I open a new issue.
I noticed that CI fails for julia 1.0.5 with @m-wells's awesome updates to Base.getindex
. In particular, it fails on the inferred type checking of @test @inferred(mapreduce(s, (acc, row) -> acc + row.sum, t; init = 0.0)) === 18.0
.
I locally modified the Base.getindex
implementations as follows:
@inline function Base.getindex(t::Table{T}, i::Int) where {T}
@boundscheck checkbounds(t, i)
old = convert(T, map(col -> @inbounds(getindex(col, i)), columns(t)))::T
new = T(@inbounds(getindex(col, i)) for col in columns(t))
@assert old === new
return new
end
@inline function Base.getindex(t::Table{T}, i::Int...) where {T}
@boundscheck checkbounds(t, i...)
old = convert(T, map(col -> @inbounds(getindex(col, i...)), columns(t)))::T
new = T(@inbounds(getindex(col, i...) for col in columns(t)))
@assert old === new
return new
end
Technically, for the failing test, only the first one matters, but I did both just for completeness.
Both @assert
checks pass perfectly fine. However, when I run the test suite, returning old
passes, but returning new
fails.
@Select and @Compute on Tables: Error During Test at /home/anirudh/projects/TypedTables.jl/test/Table.jl:169
Test threw exception
Expression: #= /home/anirudh/projects/TypedTables.jl/test/Table.jl:169 =# @inferred(mapreduce(s, ((acc, row)->begin
acc + row.sum
end), t; init=0.0)) === 18.0
return type Float64 does not match inferred return type Any
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] macro expansion at /home/anirudh/projects/TypedTables.jl/test/Table.jl:169 [inlined]
[3] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1083 [inlined]
[4] macro expansion at /home/anirudh/projects/TypedTables.jl/test/Table.jl:151 [inlined]
[5] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Test/src/Test.jl:1083 [inlined]
[6] top-level scope at /home/anirudh/projects/TypedTables.jl/test/Table.jl:2
Honestly, I'm a bit stuck here. My understanding of a === b
is that a
and b
are indistinguishable from one another, so I'm not sure how the assert could pass but the test could fail in this case. If anyone understands what's going on here, I'd love some help!
I'm still not sure why the changed submitted by @m-wells broke that test of Julia 1.0.5, but adding ::T
as follows seems to have fixed the problem for me.
T(@inbounds(getindex(col, i)) for col in columns(t))::T
On Julia 1.0.5, this does add an allocation for some reason, but it doesn't add an allocation on Julia 1.8 (still just 1 allocation total as shown in the earlier example by @m-wells). If you think this is an acceptable solution, I can open a PR. I'd still be curious if someone has an explanation for what was wrong before though.
I have a pull request that addresses this issue.