Closed jeremiedb closed 8 months ago
Attention: 8 lines
in your changes are missing coverage. Please review.
Comparison is base (
535b0c4
) 51.05% compared to head (a4c9dcb
) 51.90%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Happy to review this shortly.
Okay, I think I was mistaken to think dumping the model-specific data pre-processing (reformat
) and model-specific resampling (selectrows
) would slow down IteratedModel
. As you point out, we are only calling update
after the first train, so the conversion table -> matrix only happens once. I have also confirmed this with a benchmark, where this PR actually runs slightly faster for IteratedModel
, for Step
sizes of 1, 5, 30 and a total of 300 iterations (details at the end).
On the other hand, I do expect some impact on TunedModel
wrapping, for here changing the parameter to be tuned means we call fit
and not update
for each new parameter value. A benchmark suggest this is indeed the case, but the slowdown is small (assuming it's even statistically significant):
# on EvoTrees master branch:
@btime fit!($mach, verbosity=0)
# 15.437 μs (100 allocations: 4.55 KiB)
# on this PR:
@btime fit!($mach, verbosity=0)
# 17.439 μs (100 allocations: 4.55 KiB)
Benchmark details are attached below.
So, I won't push back against your reluctance to update the reformat/selectrows
implementation, and assume you have good reasons for not doing so.
I'm not sure I fully understand all the objections to the current API, but I suggest we have this discussion in a different forum. (Still happy to take a call to discuss this.)
API changes to MLJModelInterface, if they are breaking, will not happen in any hurry.
I will proceed with the rest of my review shortly.
@jeremiedb There seems to be a problem with handling tables that are row-based. Perhaps you are calling Tables.columnnames(X)
directly on a table X
(a common gotcha)?? You can only call this on tables implementing the AbstractColumns
interface, or on rows of a table implementing the AbstractRow
interface.
Pkg.activate("explore002", shared=true)
using MLJBase
using MLJModels
using Tables
EvoTreeBooster = @load EvoTreeRegressor
booster = EvoTreeBooster()
X, y = make_regression(1000, 5)
# this works:
mach = machine(booster, X, y) |> fit!
# this doesn't
X = Tables.rowtable(X)
mach = machine(booster, X, y) |> fit!
# ERROR: BoundsError: attempt to access 1000-element Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}} at index [1]
# Stacktrace:
# [1] getcolumn(x::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, i::Int64)
# @ Tables ~/.julia/packages/Tables/NSGZI/src/Tables.jl:101
# [2] fit(model::EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, verbosity::Int64, A::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, y::Vector{Float64}, w::Nothing)
# @ EvoTrees ~/.julia/packages/EvoTrees/jzXEo/src/MLJ.jl:3
# [3] fit(model::EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, verbosity::Int64, A::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, y::Vector{Float64})
# @ EvoTrees ~/.julia/packages/EvoTrees/jzXEo/src/MLJ.jl:3
# [4] fit_only!(mach::Machine{EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, true}; rows::Nothing,
# verbosity::Int64, force::Bool, composite::Nothing)
# @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:680
# [5] fit_only!
# @ ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:606 [inlined]
# [6] #fit!#63
# @ ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:777 [inlined]
# [7] fit!
# @ ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:774 [inlined]
# [8] |>(x::Machine{EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, true}, f::typeof(fit!))
# @ Base ./operators.jl:907
# [9] top-level scope
# @ REPL[12]:1
Perhaps you are calling
Tables.columnnames(X)
directly on a tableX
Guilty :) The Tables implementation effectively assumed working with columntables
. Since there were several dependencies to getcolumn
internally, I've added a conversion to Tables.columntable
. It doesn't add any overhead when Tables are already in such format which is the most common case I suppose (such as DataFrame).
Regarding the impact on TunedModel
I'd expect it to be very small, as when a new hyper-prams configuration is tested, a full model initialisation must be performed. Such initialization time is essentially the same for both the original Matrix and the new Tables based internal APIs. The bulk of computing time during an EvoTree training is normally spent iterating over n trees; steps which are identical in EvoTrees regardless of if the input was originally a Matrix of a Tables.
Thanks a lot for taking the time to look at the PR. And sorry about the digression about the iterated API. I'm trying to gather some coherent proposal so as to have a constructive take, will ping back once done!
I think that row-table problem persists with predict
:
EvoTreeBooster = @load EvoTreeRegressor
booster = EvoTreeBooster()
X, y = make_regression(1000, 5)
X = Tables.rowtable(X);
# smoke tests:
mach = machine(booster, X, y) |> fit! # this now works, thanks
predict(mach, X) # fails
# ERROR: BoundsError: attempt to access 1000-element Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}} at index [1]
I think that row-table problem persists with
predict
:
Thanks for the catch. I had neglected adding a test for rowtables, which has now been added within the MLJ test suite (both fit! and predict). Tests are now passing followin the patch to convert to Tables.columntable
in EvoTrees.predict
.
Regarding the @spawn
vs @thread
, I've followed the discussions around it. The post you referred to caused the migration to :static
I did in v0.15, which turned out ill advised as it resulted in issues for a user running EvoTrees under an already threaded routine. My understanding of the issue is that problems can be caused when one relies on the threadid()
as a reference to the thread making the work. This is to best of my knowledge not a pattern that has been used in the codebase. I simply wanted to move all threaded operation to a common appraoch, as the histogram builing routines are already using @threads
and provide some performance improvements over @spawn
.
Fix #260 @ablaom I moved the MLJ wrapper from the orignal Matrix API to the new Tables one, which provides native support for richer inputs, incl. Ordered and Unordered (MultiClass) CategoricalValues.