JuliaData / TableOperations.jl

Common table operations on Tables.jl interface implementations
Other
46 stars 9 forks source link

[BUG] TableOperations.select does not support categorical columns #25

Closed juliohm closed 6 months ago

juliohm commented 3 years ago

MWE:

using DataDeps
using RData
using Tables
using TableOperations

register(DataDep("juraset",
  "A geochemical dataset from the Swiss Jura",
  "https://github.com/cran/compositions/raw/master/data/juraset.rda",
  "0fcf23fbca1d6fcb58ae0de6f11365f39fa3df02828128708185ebf45b002382"))
rda  = joinpath(datadep"juraset", "juraset.rda")
jura = RData.load(rda)["juraset"]

# select non-categorical columns
t = TableOperations.select(jura, (:Cd, :Cu, :Pb, :Co, :Cr, :Ni, :Zn)...)

# this line crashes even though t doesn't have categorical data
Tables.matrix(t)
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Float64
Closest candidates are:
  convert(::Type{T}, ::Base.TwicePrecision) where T<:Number at twiceprecision.jl:250
  convert(::Type{T}, ::AbstractChar) where T<:Number at char.jl:180
  convert(::Type{T}, ::CartesianIndex{1}) where T<:Number at multidimensional.jl:136
  ...
Stacktrace:
  [1] convert(#unused#::Type{Float64}, x::CategoricalArrays.CategoricalValue{String, UInt8})
    @ CategoricalArrays ~/.julia/packages/CategoricalArrays/w8EcO/src/value.jl:92
  [2] setindex!
    @ ./array.jl:841 [inlined]
  [3] macro expansion
    @ ./multidimensional.jl:903 [inlined]
  [4] macro expansion
    @ ./cartesian.jl:64 [inlined]
  [5] _unsafe_setindex!(::IndexLinear, ::Matrix{Float64}, ::CategoricalArrays.CategoricalVector{String, UInt8, String, CategoricalArrays.CategoricalValue{String, UInt8}, Union{}}, ::Base.Slice{Base.OneTo{Int64}}, ::Int64)
    @ Base ./multidimensional.jl:898
  [6] _setindex!
    @ ./multidimensional.jl:887 [inlined]
  [7] setindex!
    @ ./abstractarray.jl:1267 [inlined]
  [8] matrix(table::TableOperations.Select{DataFrames.DataFrameColumns{DataFrames.DataFrame}, true, (:Cd, :Cu, :Pb, :Co, :Cr, :Ni, :Zn)}; transpose::Bool)
    @ Tables ~/.julia/packages/Tables/i6z2B/src/matrix.jl:81
  [9] matrix(table::TableOperations.Select{DataFrames.DataFrameColumns{DataFrames.DataFrame}, true, (:Cd, :Cu, :Pb, :Co, :Cr, :Ni, :Zn)})
    @ Tables ~/.julia/packages/Tables/i6z2B/src/matrix.jl:74
 [10] top-level scope
    @ REPL[39]:1
quinnj commented 3 years ago

cc: @nalimilan ?

nalimilan commented 3 years ago

The example given above works here on Julia master.

Also the comment # select non-categorical columns is wrong, as :Rock and :Land columns are CategoricalArrays. And indeed the backtrace clearly indicates that a CategoricalVector is involved. What's weird is that Tables.matrix seems to be trying to create a Matrix{Float64}, which doesn't fit the type.

Would you be able to provide a simpler reproducer?

juliohm commented 3 years ago

The example given above works here on Julia master.

Also the comment # select non-categorical columns is wrong, as :Rock and :Land columns are CategoricalArrays. And indeed the backtrace clearly indicates that a CategoricalVector is involved. What's weird is that Tables.matrix seems to be trying to create a Matrix{Float64}, which doesn't fit the type.

Would you be able to provide a simpler reproducer?

How is the comment wrong? The selection is only mentioning columns that are float? It shouldn't involve the other columns that are categorical and yet the Tables.matrix crashes. I will try to reproduce with the master branch, maybe the issue has been fixed and a release is missing?

nalimilan commented 3 years ago

Sorry, I was confused by the fact that printing t prints the DataFrameColumns object from which it's extracted.

I'm able to reproduce the problem. It seems that TableOperations.select incorrectly returns the first columns of its parent table:

julia> collect(Tables.Columns(t)) == collect(Tables.Columns(jura[!, 1:7]))
true
juliohm commented 6 months ago

Closing as we now have TableTransforms.Select that is more flexible and actively maintained.