Evovest / EvoTrees.jl

Boosted trees in Julia
https://evovest.github.io/EvoTrees.jl/dev/
Apache License 2.0
175 stars 20 forks source link

Increasing max_depth causes memory leak #121

Closed john-waczak closed 2 years ago

john-waczak commented 2 years ago

I have been able to train an EvoTreeRegressor with the default parameters successfully. When I try to increase the max_depth parameter beyond 10 suddenly my memory usage spikes and Julia dies.

Here's a snippet from the REPL

julia> evo = EvoTreeRegressor(max_depth=15, rng=42)
EvoTreeRegressor(
    loss = EvoTrees.Linear(),
    nrounds = 10,
    λ = 0.0,
    γ = 0.0,
    η = 0.1,
    max_depth = 15,
    min_weight = 1.0,
    rowsample = 1.0,
    colsample = 1.0,
    nbins = 64,
    α = 0.5,
    metric = :mse,
    rng = MersenneTwister(42),
    device = "cpu")

julia> mach = machine(evo, Xtrain, CDOM_train)
Machine{EvoTreeRegressor{Float64,…},…} trained 0 times; caches data
  args: 
    1:  Source @710 ⏎ `Table{AbstractVector{Continuous}}`
    2:  Source @134 ⏎ `AbstractVector{Continuous}`

julia> fit!(mach, verbosity=2)
[ Info: Training Machine{EvoTreeRegressor{Float64,…},…}.

Process julia killed
ablaom commented 2 years ago

@john-waczak Thanks for reporting! Good to know about this.

A complete minimum working example might speed up resolution, ideally without the the MLJ wrap.

john-waczak commented 2 years ago

Okay, here's a MWE. I crash when running the following on Ubuntu 21.04 machine w/ 16GB ram and 4 core i7-7700HQ @ 2.80GHz

using EvoTrees

# Simple Regression Demo
n=2000;
X = 2*(rand(n,2) .- 0.5);

y = X[:,1].^5 + X[:,2].^4 - X[:,1].^4 - X[:,2].^3

size(X)
size(y)

# train for first time with default settings
params1 = EvoTreeRegressor()
model = fit_evotree(params1, X, y)

# train wit increased max_depth
# this causes julia to crash
params2 = EvoTreeRegressor(max_depth=20)
model = fit_evotree(params2, X, y) 

Here's the output of Pkg.status:

(evoTree_bug) pkg> status
      Status `~/gitRepos/evoTree_bug/Project.toml`
  [f6006082] EvoTrees v0.8.4

Here's a screenshot of my memory usage: image

jeremiedb commented 2 years ago

Thanks for reporting! For what I can tell, it doesn't seem an issue per se or a memory leak, but more of a consequence of the design choices geared toward fitting speed which results in significant memory pre-allocations. Specifically, histograms for each tree nodes are pre-allocated, and in the case of a depth of 20, there are over 500K such nodes. What appears like a memory leak is actually a long pre-allocation process.

However, in gradient boosted model, each tree act as a weak learner and as such, I'm not aware of situation where depth much greater than 10 were of any value. Typically, a depth in the 3-8 range will best perform. Let me know if you are in a situation where greater depth is needed. I'm afraid though a significantly different design, potentially less efficient, would be needed to support such scenarios,

john-waczak commented 2 years ago

@jeremiedb Thanks for your reply! That makes a lot of sense. I think I should be more than fine with a smaller max_depth. I was trying some hyper-parameter variation just to see what would happen and noticed the script kept dying once it got past 10 or so.