Mlj table - Githubissues

jeremiedb commented 8 months ago

Fix #260 @ablaom I moved the MLJ wrapper from the orignal Matrix API to the new Tables one, which provides native support for richer inputs, incl. Ordered and Unordered (MultiClass) CategoricalValues.

codecov[bot] commented 8 months ago

Codecov Report

Attention: 8 lines in your changes are missing coverage. Please review.

Comparison is base (535b0c4) 51.05% compared to head (a4c9dcb) 51.90%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #262 +/- ## ========================================== + Coverage 51.05% 51.90% +0.84% ========================================== Files 22 22 Lines 1986 1944 -42 ========================================== - Hits 1014 1009 -5 + Misses 972 935 -37 ``` | [Files](https://app.codecov.io/gh/Evovest/EvoTrees.jl/pull/262?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None) | Coverage Δ | | |---|---|---| | [src/fit-utils.jl](https://app.codecov.io/gh/Evovest/EvoTrees.jl/pull/262?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None#diff-c3JjL2ZpdC11dGlscy5qbA==) | `90.36% <100.00%> (-1.36%)` | :arrow_down: | | [src/fit.jl](https://app.codecov.io/gh/Evovest/EvoTrees.jl/pull/262?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None#diff-c3JjL2ZpdC5qbA==) | `96.39% <100.00%> (+0.03%)` | :arrow_up: | | [src/predict.jl](https://app.codecov.io/gh/Evovest/EvoTrees.jl/pull/262?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None#diff-c3JjL3ByZWRpY3Quamw=) | `92.24% <100.00%> (+0.06%)` | :arrow_up: | | [src/MLJ.jl](https://app.codecov.io/gh/Evovest/EvoTrees.jl/pull/262?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None#diff-c3JjL01MSi5qbA==) | `88.88% <92.30%> (+2.88%)` | :arrow_up: | | [ext/EvoTreesCUDAExt/predict.jl](https://app.codecov.io/gh/Evovest/EvoTrees.jl/pull/262?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None#diff-ZXh0L0V2b1RyZWVzQ1VEQUV4dC9wcmVkaWN0Lmps) | `0.00% <0.00%> (ø)` | | | [ext/EvoTreesCUDAExt/fit-utils.jl](https://app.codecov.io/gh/Evovest/EvoTrees.jl/pull/262?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None#diff-ZXh0L0V2b1RyZWVzQ1VEQUV4dC9maXQtdXRpbHMuamw=) | `0.00% <0.00%> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ablaom commented 8 months ago

Happy to review this shortly.

ablaom commented 8 months ago

Okay, I think I was mistaken to think dumping the model-specific data pre-processing (reformat) and model-specific resampling (selectrows) would slow down IteratedModel. As you point out, we are only calling update after the first train, so the conversion table -> matrix only happens once. I have also confirmed this with a benchmark, where this PR actually runs slightly faster for IteratedModel, for Step sizes of 1, 5, 30 and a total of 300 iterations (details at the end).

On the other hand, I do expect some impact on TunedModel wrapping, for here changing the parameter to be tuned means we call fit and not update for each new parameter value. A benchmark suggest this is indeed the case, but the slowdown is small (assuming it's even statistically significant):


# on EvoTrees master branch:

@btime fit!($mach, verbosity=0)
#  15.437 μs (100 allocations: 4.55 KiB)

# on this PR:

@btime fit!($mach, verbosity=0)
#  17.439 μs (100 allocations: 4.55 KiB)

Benchmark details are attached below.

So, I won't push back against your reluctance to update the reformat/selectrows implementation, and assume you have good reasons for not doing so.

I'm not sure I fully understand all the objections to the current API, but I suggest we have this discussion in a different forum. (Still happy to take a call to discuss this.)

API changes to MLJModelInterface, if they are breaking, will not happen in any hurry.

I will proceed with the rest of my review shortly.

```julia using MLJModels using MLJIteration using MLJBase using BenchmarkTools using StatisticalMeasures using MLJTuning EvoTreeBooster = @load EvoTreeRegressor booster = EvoTreeBooster() X, y = make_regression(1000, 5) # # CONTROLLED ITERATION # number of iterations: N = 300 steps = [1, 5, 30] iterated_boosters = [ IteratedModel( model=booster, resampling=Holdout(fraction_train=0.8), controls=[Step(s), NumberLimit(round(Int, N/s))], measure=l1, ) for s in steps ] for ib in iterated_boosters mach = machine(ib, X, y) print("step=", ib.controls[1].n) @btime fit!($mach, verbosity=0) @assert report(mach).n_iterations == N end # on EvoTrees master branch: # step=1 14.351 μs (101 allocations: 4.61 KiB) # step=5 14.301 μs (101 allocations: 4.61 KiB) # step=30 14.342 μs (101 allocations: 4.61 KiB) # on this PR: # step=1 13.701 μs (101 allocations: 4.61 KiB) # step=5 13.726 μs (101 allocations: 4.61 KiB) # step=30 13.776 μs (101 allocations: 4.61 KiB) # # TUNING r = range(booster, :lambda, lower = 0, upper = 0.5) tuned_booster = TunedModel( booster, tuning=RandomSearch(), range=r, measure = l2, n = 500, ) mach = machine(tuned_booster, X, y) # on EvoTrees master branch: @btime fit!($mach, verbosity=0) # 15.437 μs (100 allocations: 4.55 KiB) # on this PR: @btime fit!($mach, verbosity=0) # 17.439 μs (100 allocations: 4.55 KiB) ```

ablaom commented 8 months ago

@jeremiedb There seems to be a problem with handling tables that are row-based. Perhaps you are calling Tables.columnnames(X) directly on a table X (a common gotcha)?? You can only call this on tables implementing the AbstractColumns interface, or on rows of a table implementing the AbstractRow interface.

Pkg.activate("explore002", shared=true)

using MLJBase
using MLJModels
using Tables

EvoTreeBooster = @load EvoTreeRegressor
booster = EvoTreeBooster()

X, y = make_regression(1000, 5)

# this works:
mach = machine(booster, X, y) |> fit!

# this doesn't
X = Tables.rowtable(X)
mach = machine(booster, X, y) |> fit!

# ERROR: BoundsError: attempt to access 1000-element Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}} at index [1] 
# Stacktrace:
#  [1] getcolumn(x::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, i::Int64)                                                                                      
#    @ Tables ~/.julia/packages/Tables/NSGZI/src/Tables.jl:101
#  [2] fit(model::EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, verbosity::Int64, A::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, y::Vector{Float64}, w::Nothing)
#    @ EvoTrees ~/.julia/packages/EvoTrees/jzXEo/src/MLJ.jl:3
#  [3] fit(model::EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, verbosity::Int64, A::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, y::Vector{Float64})            
#    @ EvoTrees ~/.julia/packages/EvoTrees/jzXEo/src/MLJ.jl:3
#  [4] fit_only!(mach::Machine{EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, true}; rows::Nothing, 
# verbosity::Int64, force::Bool, composite::Nothing)                                         
#    @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:680
#  [5] fit_only!
#    @ ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:606 [inlined]
#  [6] #fit!#63
#    @ ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:777 [inlined]
#  [7] fit!
#    @ ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:774 [inlined]
#  [8] |>(x::Machine{EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, true}, f::typeof(fit!))
#    @ Base ./operators.jl:907
#  [9] top-level scope
#    @ REPL[12]:1

jeremiedb commented 8 months ago

Perhaps you are calling Tables.columnnames(X) directly on a table X

Guilty :) The Tables implementation effectively assumed working with columntables. Since there were several dependencies to getcolumn internally, I've added a conversion to Tables.columntable. It doesn't add any overhead when Tables are already in such format which is the most common case I suppose (such as DataFrame).

Regarding the impact on TunedModel I'd expect it to be very small, as when a new hyper-prams configuration is tested, a full model initialisation must be performed. Such initialization time is essentially the same for both the original Matrix and the new Tables based internal APIs. The bulk of computing time during an EvoTree training is normally spent iterating over n trees; steps which are identical in EvoTrees regardless of if the input was originally a Matrix of a Tables.

Thanks a lot for taking the time to look at the PR. And sorry about the digression about the iterated API. I'm trying to gather some coherent proposal so as to have a constructive take, will ping back once done!

ablaom commented 8 months ago

I think that row-table problem persists with predict:

EvoTreeBooster = @load EvoTreeRegressor
booster = EvoTreeBooster()

X, y = make_regression(1000, 5)
X = Tables.rowtable(X);

# smoke tests:
mach = machine(booster, X, y) |> fit!    # this now works, thanks
predict(mach, X) # fails

# ERROR: BoundsError: attempt to access 1000-element Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}} at index [1]

Stack trace

``` Stacktrace: [1] getcolumn(x::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, i::Int64) @ Tables ~/.julia/packages/Tables/NSGZI/src/Tables.jl:101 [2] binarize(df::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}; fnames::NTuple{5, Symbol}, edges::Vector{Any}) @ EvoTrees ~/.julia/packages/EvoTrees/3Ocxy/src/fit-utils.jl:74 [3] predict(m::EvoTrees.EvoTree{EvoTrees.MSE, 1}, data::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, ::Type{EvoTrees.CPU}; ntree_limit::Int64) @ EvoTrees ~/.julia/packages/EvoTrees/3Ocxy/src/predict.jl:91 [4] predict(m::EvoTrees.EvoTree{EvoTrees.MSE, 1}, data::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, ::Type{EvoTrees.CPU}) @ EvoTrees ~/.julia/packages/EvoTrees/3Ocxy/src/predict.jl:83 [5] predict(m::EvoTrees.EvoTree{EvoTrees.MSE, 1}, data::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}) @ EvoTrees ~/.julia/packages/EvoTrees/3Ocxy/src/predict.jl:83 [6] predict(#unused#::EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, fitresult::EvoTrees.EvoTree{EvoTrees.MSE, 1}, A::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}) @ EvoTrees ~/.julia/packages/EvoTrees/3Ocxy/src/MLJ.jl:45 [7] predict(mach::Machine{EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, true}, Xraw::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}) @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/operations.jl:133 [8] top-level scope @ REPL[42]:1 ```

jeremiedb commented 8 months ago

I think that row-table problem persists with predict:

Thanks for the catch. I had neglected adding a test for rowtables, which has now been added within the MLJ test suite (both fit! and predict). Tests are now passing followin the patch to convert to Tables.columntable in EvoTrees.predict.

Regarding the @spawn vs @thread, I've followed the discussions around it. The post you referred to caused the migration to :static I did in v0.15, which turned out ill advised as it resulted in issues for a user running EvoTrees under an already threaded routine. My understanding of the issue is that problems can be caused when one relies on the threadid() as a reference to the thread making the work. This is to best of my knowledge not a pattern that has been used in the codebase. I simply wanted to move all threaded operation to a common appraoch, as the histogram builing routines are already using @threads and provide some performance improvements over @spawn.

Evovest / EvoTrees.jl

Mlj table #262

Codecov Report