JuliaData / TableOperations.jl

Common table operations on Tables.jl interface implementations
Other
46 stars 9 forks source link

schema fails when selecting columns from a really wide table #20

Closed OkonSamuel closed 3 years ago

OkonSamuel commented 3 years ago

Hello Due to a recent change in Tables.jl developers now need to also consider specializing on Tables.Schema{nothing, nothing} in addition to Tables.Schema{names, types}. see here. Users will eventually pop into this error when selecting a few columns from a really wide table as shown below.

julia> ncols = 1000000
1000000

julia> df = DataFrame(rand(6, ncols), :auto);

julia> n = TableOperations.select(df, :x1, :x2);

julia> Tables.schema(n);
ERROR: MethodError: no method matching columntype(::Nothing, ::Nothing, ::Symbol)
OkonSamuel commented 3 years ago

The following naive solution should do the trick. Maybe someone has a better fix?

function typesubset(sch::Tables.Schema{nothing, nothing}, nms::NTuple{N, Symbol}) where {N}
    names = sch.names
     types = sch.types
    return Tuple{Any[Tables.columntype(names, types, nm) for nm in nms]...}
end

function typesubset(sch::Tables.Schema{nothing, nothing}, inds::NTuple{N, Int}) where {N}
    types = sch.types
    return Tuple{Any[types[i] for i in inds]...}
end
typesubset(::Tables.Schema{nothing, nothing}, ::Tuple{}) = Tuple{}

namesubset(::Tables.Schema{nothing, nothing}, nms::NTuple{N, Symbol}) where {N} = nms
Base.@pure namesubset(::Tables.Schema{nothing, nothing}, inds::NTuple{N, Int}) where {N} = (names = sch.names; ntuple(i -> names[inds[i]], N))
namesubset(::Tables.Schema{nothing, nothing}, ::Tuple{}) = ()
quinnj commented 3 years ago

Due to a recent change in Tables.jl developers now need to also consider specializing on Tables.Schema{nothing, nothing} in addition to Tables.Schema{names, types}. see here.

It should be noted that the threshold where Tables.jl will switch to this alternative Schema representation is pretty high: 67_000 columns. This was chosen specifically because many operations on tables this wide were failing anyway; there are fundamental limits in the compiler right now that mean creating tuples/namedtuples that large start breaking in weird ways.

Anyway, that aside, yes, we should fix the case here. I just wanted to clarify that this kind of operation didn't work before anyway.

quinnj commented 3 years ago

PR up: https://github.com/JuliaData/TableOperations.jl/pull/22