JuliaAI / DecisionTree.jl

Julia implementation of Decision Tree (CART) and Random Forest algorithms
Other
355 stars 102 forks source link

Excessive memory usage #123

Open CameronBieganek opened 4 years ago

CameronBieganek commented 4 years ago

I have a data set of dimensions (87390, 243). Most of the columns are categorical variables that have been one-hot encoded. The size of the data set in memory is ~160 MB. I compared the memory usage for DecisionTree.jl and R's ranger package.

DecisionTree.jl

using DecisionTree

df = CSV.read("rf_training_data.csv")

y = string.(df.y)
X = convert(Matrix, df[:, 2:end])

n_subfeatures = 15
n_trees = 600

# Default vaues:
# partial_sampling = 0.7
# max_depth = -1
# min_samples_leaf = 1

rf = build_forest(y, X, n_subfeatures, n_trees)

Memory consumption:

julia> varinfo(r"rf")
  name      size summary                 
  –––– ––––––––– ––––––––––––––––––––––––
  rf   1.417 GiB Ensemble{Float64,String}

ranger

library(readr)
library(ranger)

df <- read_csv('rf_training_data.csv')
df$y <- factor(df$y)

rf <- ranger(
    y ~ .,
    data = df,
    num.trees = 600,
    mtry = 15,
    min.node.size = 1,
    replace = FALSE,
    sample.fraction = 0.7
)

Memory consumption:

> print(object.size(rf), units = "MB")
585.2 Mb

Conclusion

Thus, it appears that DecisionTree.jl is using 2.4x as much memory as ranger for this model. Is it possible to reduce the memory footprint of DecisionTree.jl? I can provide a scrubbed version of my data set if that helps.

bensadeghi commented 4 years ago

You could cast the features to a concrete type (ie X = Int.(X)) as opposed to using the Any type, which is quite heavy. That should help a little bit. But otherwise, we need a new implementation of the Leaf type (see #90), which requires a significant amount of work.

CameronBieganek commented 4 years ago

You could cast the features to a concrete type (ie X = Int.(X)) as opposed to using the Any type, which is quite heavy. That should help a little bit.

The features matrix in my example had typeof(X) == Array{Float64,2}, so I think I dodged that bullet.