JuliaAI / MLJLinearModels.jl

Generalized Linear Regressions Models (penalized regressions, robust regressions, ...)
MIT License
81 stars 13 forks source link

Feature request: Support for tables #154

Closed ParadaCarleton closed 11 months ago

ParadaCarleton commented 11 months ago

For example:

julia> using DataFrames, CategoricalArrays

julia> x = hcat(DataFrame(randn(10, 5), :auto), DataFrame(CategoricalArray.(eachcol(rand(["1", "2", "3", "4"], 10, 5))), :auto); makeunique=true)
10×10 DataFrame
 Row │ x1           x2           x3          x4          x5          x6          x7          ⋯
     │ Float64      Float64      Float64     Float64     Float64     Float64     Float64     ⋯
─────┼────────────────────────────────────────────────────────────────────────────────────────
   1 │  0.0076124    2.21962     -1.89752     0.24856    -0.185547   -1.18036     0.177196   ⋯
   2 │ -0.425131    -1.95286     -2.08625     0.480588    1.72549    -0.28748    -0.711898
   3 │ -0.678378     0.956257    -0.426269   -0.740123    1.94817     0.0582993   0.814919
   4 │  0.815181     0.882876    -0.527539   -0.769075    0.401716   -1.25234     0.216388
   5 │ -0.430265     0.74117      0.0932157  -0.80661     1.83201    -1.00751     0.0808424  ⋯
   6 │ -0.953549     0.921323    -0.192622   -0.152674   -0.829379    0.629351    0.719016
   7 │ -1.55753      0.580445     0.428604   -0.423595    1.187      -0.730763   -1.19092
   8 │ -1.93545      0.120406     0.898218    0.629203   -0.164727    0.121863   -0.46737
   9 │ -3.16131     -2.60021      0.0405212  -0.635231    1.09621     0.09391     2.50053    ⋯
  10 │ -1.43519      0.240422    -0.0817438  -0.0991257  -0.122359   -0.243555    1.09018

julia> config = LinearRegressor()
LinearRegressor(
  fit_intercept = true, 
  solver = nothing)

julia> tuned_machine = machine(config, x[:, Not(1)], x[:, 1]) |> fit!
ERROR: MethodError: no method matching fit(::GeneralizedLinearRegression{L2Loss, NoPenalty}, ::Matrix{Any}, ::Vector{Float64}; solver::Analytical)

Closest candidates are:
  fit(::GeneralizedLinearRegression, ::AbstractMatrix{<:Real}, ::AbstractVector{<:Real}; data, solver)
   @ MLJLinearModels ~/.julia/packages/MLJLinearModels/yYgtO/src/fit/default.jl:36
  fit(::GeneralizedLinearRegression; kwargs...)
   @ MLJLinearModels ~/.julia/packages/MLJLinearModels/yYgtO/src/fit/default.jl:50

Stacktrace:
  [1] fit(m::LinearRegressor, verb::Int64, X::DataFrame, y::Vector{Float64})
    @ MLJLinearModels ~/.julia/packages/MLJLinearModels/yYgtO/src/mlj/interface.jl:29
  [2] fit_only!(mach::Machine{LinearRegressor, true}; rows::Nothing, verbosity::Int64, force::Bool, composite::Nothing)
    @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:680
  [3] fit_only!
    @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:606 [inlined]
  [4] #fit!#63
    @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:777 [inlined]
  [5] fit!
    @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:774 [inlined]
  [6] |>(x::Machine{LinearRegressor, true}, f::typeof(fit!))
    @ Base ./operators.jl:917
  [7] top-level scope
    @ REPL[24]:1

Similar issue in EvoTrees.jl.

tlienart commented 11 months ago

Unrelated to MLJLinearModels. Please ask on discourse for help or possibly open an issue in MLJ directly. The data that gets passed through is not properly typed. You can see that here:

julia> tuned_machine = machine(config, x[:, Not(1)], x[:, 1]) |> fit!
ERROR: MethodError: no method matching fit(::GeneralizedLinearRegression{L2Loss, NoPenalty}, ::Matrix{Any}, ::Vector{Float64}; solver::Analytical)

It should be a Matrix{<:Real}, this suggests that you might have missed an encoding step.

ParadaCarleton commented 11 months ago

The data that gets passed through is not properly typed. You can see that here:

Right, sorry, I was under the impression that MLJ models were expected to accept arbitrary tables as inputs, rather than just accepting Matrix{<:Real}. I'll edit this issue, then.

tlienart commented 11 months ago

Issue name is incorrect, MLJ handles tables just fine and MLJLM handles matrices as it should too; the interface is handled by MLJ; the issue here is that you did not encode the categorical features.

julia> using DataFrames, CategoricalArrays, ScientificTypes, MLJModelInterface, MLJBase

julia> X = hcat(DataFrame(randn(10, 5), :auto), DataFrame(CategoricalArray.(eachcol(rand(["1", "2", "3", "4"], 10, 5))), :auto); makeunique=true);

julia> schema(X)
┌───────┬───────────────┬──────────────────────────────────┐
│ names │ scitypes      │ types                            │
├───────┼───────────────┼──────────────────────────────────┤
│ x1    │ Continuous    │ Float64                          │
│ x2    │ Continuous    │ Float64                          │
│ x3    │ Continuous    │ Float64                          │
│ x4    │ Continuous    │ Float64                          │
│ x5    │ Continuous    │ Float64                          │
│ x1_1  │ Multiclass{4} │ CategoricalValue{String, UInt32} │
│ x2_1  │ Multiclass{4} │ CategoricalValue{String, UInt32} │
│ x3_1  │ Multiclass{3} │ CategoricalValue{String, UInt32} │
│ x4_1  │ Multiclass{4} │ CategoricalValue{String, UInt32} │
│ x5_1  │ Multiclass{4} │ CategoricalValue{String, UInt32} │
└───────┴───────────────┴──────────────────────────────────┘

julia> typeof(MLJModelInterface.matrix(X))
Matrix{Any} (alias for Array{Any, 2})

The MLJModelInterface.matrix(X) is how MLJ takes training data and passes it over to MLJLinearModels; as you can see the output is an un-typed matrix because it's got columns of strings with "1", "2" etc.

TLDR: use an encoder then pass to the linear regressor.