EvoTree's predict doesn't return a valid probability when max_depth is relatively high

pgagarinov commented 3 years ago

The last line of the following block throws an exception

using Pkg;
Pkg.add(["DataFrames", "CSV", "TabularDisplay", "CategoricalArrays"]);
using DataFrames, CSV, TabularDisplay, CategoricalArrays
Pkg.add(["MLJ", "EvoTrees", "MLJScientificTypes"])
using MLJ, EvoTrees, MLJScientificTypes

num_cols = [
    "ClientPeriod",
    "MonthlySpending",
    "TotalSpent"
];

cat_cols = [
    "Sex",
    "IsSeniorCitizen",
    "HasPartner",
    "HasChild",
    "HasPhoneService",
    "HasMultiplePhoneNumbers",
    "HasInternetService",
    "HasOnlineSecurityService",
    "HasOnlineBackup",
    "HasDeviceProtection",
    "HasTechSupportAccess",
    "HasOnlineTV",
    "HasMovieSubscription",
    "HasContractPhone",
    "IsBillingPaperless",
    "PaymentMethod"
];
all_feature_cols = [num_cols; cat_cols];
target_col = "Churn";

#,types=Dict("Sex"=>CategoricalValue{String, UInt32})
# + tags=[]
df = DataFrame!(CSV.File("./train.csv",pool=0.1, missingstrings=[" "]))
categorical!(df,[cat_cols;target_col]);
describe(df,:eltype,:nunique, :nmissing)
dropmissing!(df);
describe(df,:eltype,:nunique, :nmissing)

X = df[!, all_feature_cols];
y = df[!,target_col];

mach_x = machine(ContinuousEncoder(), X)
fit!(mach_x)
X = MLJ.transform(mach_x, X)

tree_model = EvoTreeClassifier(max_depth=6, nrounds=2000,colsample=0.3)
mach = machine(tree_model, X, y)

train, test = partition(eachindex(y), 0.7, shuffle=true); # 70:30 split

fit!(mach, rows=train, verbosity=1)
pred_test = MLJ.predict(mach, selectrows(X, test))

The problem seems to be related to https://github.com/alan-turing-institute/MLJBase.jl/issues/525

Here is the exception.

DomainError with Probabilities must be in [0,1].:

Stacktrace:
  [1] _err_01()
    @ MLJBase ~/.julia/packages/MLJBase/pCCd7/src/univariate_finite/types.jl:42
  [2] _check_probs_01(probs::Vector{Float32})
    @ MLJBase ~/.julia/packages/MLJBase/pCCd7/src/univariate_finite/types.jl:66
  [3] _broadcast_getindex_evalf
    @ ./broadcast.jl:648 [inlined]
  [4] _broadcast_getindex
    @ ./broadcast.jl:621 [inlined]
  [5] getindex
    @ ./broadcast.jl:575 [inlined]
  [6] copy
    @ ./broadcast.jl:922 [inlined]
  [7] materialize
    @ ./broadcast.jl:883 [inlined]
  [8] UnivariateFinite(::MLJModelInterface.FullInterface, prob_given_class::OrderedCollections.LittleDict{CategoricalValue{Int64, UInt8}, AbstractVector{Float32}, Vector{CategoricalValue{Int64, UInt8}}, Vector{AbstractVector{Float32}}}; kwargs::Base.Iterators.Pairs{Symbol, Union{Missing, Bool}, Tuple{Symbol, Symbol}, NamedTuple{(:pool, :ordered), Tuple{Missing, Bool}}})
    @ MLJBase ~/.julia/packages/MLJBase/pCCd7/src/univariate_finite/types.jl:127
  [9] _UnivariateFinite(support::CategoricalVector{Int64, UInt8, Int64, CategoricalValue{Int64, UInt8}, Union{}}, probs::LinearAlgebra.Transpose{Float32, Base.ReshapedArray{Float32, 2, Base.ReinterpretArray{Float32, 1, StaticArrays.SVector{2, Float32}, Vector{StaticArrays.SVector{2, Float32}}, false}, Tuple{}}}, N::Int64; augment::Bool, kwargs::Base.Iterators.Pairs{Symbol, Union{Missing, Bool}, Tuple{Symbol, Symbol}, NamedTuple{(:pool, :ordered), Tuple{Missing, Bool}}})
    @ MLJBase ~/.julia/packages/MLJBase/pCCd7/src/univariate_finite/types.jl:245
 [10] _UnivariateFinite(support::Vector{Int64}, probs::LinearAlgebra.Transpose{Float32, Base.ReshapedArray{Float32, 2, Base.ReinterpretArray{Float32, 1, StaticArrays.SVector{2, Float32}, Vector{StaticArrays.SVector{2, Float32}}, false}, Tuple{}}}, N::Int64; augment::Bool, pool::Missing, ordered::Bool)
    @ MLJBase ~/.julia/packages/MLJBase/pCCd7/src/univariate_finite/types.jl:287
 [11] #_UnivariateFinite#37
    @ ~/.julia/packages/MLJBase/pCCd7/src/univariate_finite/types.jl:308 [inlined]
 [12] UnivariateFinite(::MLJModelInterface.FullInterface, support::Vector{Int64}, probs::LinearAlgebra.Transpose{Float32, Base.ReshapedArray{Float32, 2, Base.ReinterpretArray{Float32, 1, StaticArrays.SVector{2, Float32}, Vector{StaticArrays.SVector{2, Float32}}, false}, Tuple{}}}; kwargs::Base.Iterators.Pairs{Symbol, Missing, Tuple{Symbol}, NamedTuple{(:pool,), Tuple{Missing}}})
    @ MLJBase ~/.julia/packages/MLJBase/pCCd7/src/univariate_finite/types.jl:212
 [13] UnivariateFinite(support::Vector{Int64}, probs::LinearAlgebra.Transpose{Float32, Base.ReshapedArray{Float32, 2, Base.ReinterpretArray{Float32, 1, StaticArrays.SVector{2, Float32}, Vector{StaticArrays.SVector{2, Float32}}, false}, Tuple{}}}; kwargs::Base.Iterators.Pairs{Symbol, Missing, Tuple{Symbol}, NamedTuple{(:pool,), Tuple{Missing}}})
    @ MLJModelInterface ~/.julia/packages/MLJModelInterface/tegnW/src/data_utils.jl:431
 [14] predict(#unused#::EvoTreeClassifier{Float32, EvoTrees.Softmax, Int64}, fitresult::EvoTrees.GBTree{2, Float32, Int64}, A::NamedTuple{(:matrix, :names), Tuple{Matrix{Float64}, Vector{Symbol}}})
    @ EvoTrees ~/.julia/packages/EvoTrees/L5jFX/src/MLJ.jl:56
 [15] predict(mach::Machine{EvoTreeClassifier{Float32, EvoTrees.Softmax, Int64}, true}, Xraw::DataFrame)
    @ MLJBase ~/.julia/packages/MLJBase/pCCd7/src/operations.jl:83
 [16] top-level scope
    @ In[22]:1
 [17] eval
    @ ./boot.jl:360 [inlined]
 [18] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1094

This behavior creates a real problem when doing a hyper-parameter search as per https://alan-turing-institute.github.io/MLJ.jl/stable/#Lightning-tour-1

The data is attached train.zip

jeremiedb commented 3 years ago

Thanks for reporting this! I realized that I hadn't incorporated epsilon buffers for stability into some gradient calculations, which I suspect is the root cause for those NAs. I should be able to validate that later today.

jeremiedb commented 3 years ago

@pgagarinov I've just pushed a PR that I think should partly solves the issue you encountered. Epsilons were added to the gradients calculations and provide some remedy. However, I did encountered some cases where the model still overflowed, the reason I suspect being the accumulated predictions from all the trees reaching an overflow. Having a mechanism to cap the total prediction remains to be considered.

In order to limit the risk of encountering such issues, I'd recommend using the algorithm on Float64, which can be set by specifying T=Float64 in the parameters. With v0.7.2 that I just pushed, Float64 is now the default instead of Float32.

Also, for the case classification with 2 outputs, I noticed that the logistic regression appears to converge and train faster that the softmax / multi-class approach. This requires however to pass y as a numeric / float:

EvoTreeRegressor(T = Float64,
    loss=:logistic, metric = :logloss,
    nrounds=100, nbins = 100,
    λ = 0.5, γ=0.1, η=0.1,
    max_depth = 6, min_weight = 5.0,
    rowsample=0.5, colsample=1.0)

pgagarinov commented 3 years ago

@jeremiedb Thanks for fixing this, I'll test this and let you know if this helps.

jeremiedb commented 3 years ago

@pgagarinov There's been further improvements to stability, speed and memory consumption in version 0.8.0. I'd assume it would resolve the current issue. Let me know otherwise.

Evovest / EvoTrees.jl

EvoTree's predict doesn't return a valid probability when max_depth is relatively high #84