JuliaAI / DecisionTree.jl

Julia implementation of Decision Tree (CART) and Random Forest algorithms
Other
342 stars 101 forks source link

Problems saving random forest model #44

Open opterix opened 7 years ago

opterix commented 7 years ago

Hi all,

I'm having a strange problem when saving a random forest model. When using the JLD module to save the model created by the DecisionTree module, it usually takes a huge amount of space on disk. For instance, a model that had a size of 155 Mb as a variable in julia, would take more than 2 Gb to save in disk and an eternity to save the model!

I solved this problem by using the serialize and deserialize commands. After this, I was able to save the models. However, the drawback of this method is that maybe it won't be possible to read the models in other Julia versions. I didn't know if this is an issue from the DecisionTree module or the JLD module but would like to know if you have another workaround.

Cheers,

cstjean commented 7 years ago

Could you isolate the problem (by eg. serializing parts of your object, and finding a small one that demonstrates the problem)? Then it'll be easier to figure out whether JLD is at fault.

It could be because of the Leaf structure:

immutable Leaf
    majority::Any
    values::Vector
end

@bensadeghi I think we've discussed this before, but couldn't we replace values with a few summary statistics/counts? It seems like we're taking a lot more memory than strictly necessary, and we're doing computation at prediction time that could be done at training time.

bensadeghi commented 7 years ago

@cstjean Yes, changing the values field from an array of labels to say a Dict of counts is the right way to go. My concern is with the AdaBoost routines, which require a position index of the labels. Then again, we could just drop the AdaBoost functionality all together.

bensadeghi commented 7 years ago

@cstjean I started a new type which includes a Dict of counts, and updated build_tree() to take advantage of it:

immutable LeafC
    majority::Any
    counts::Dict{Any,Int64}
end

Surprisingly, from my preliminary results, there wasn't any improvement on the size of the models in memory (via whos()), nor when written to disk via JLD. These were sizing runs on classifiers, iris trees and large forests (1000 trees).

See the compact_tree branch for the changes. Any thoughts?

cstjean commented 7 years ago

Was it significantly faster for prediction?

A Dict has a lot of memory overhead, and it's too bad that we have to store the keys, since they're always the same. If we only cared about the ScikitLearn interface, we could store the keys in DecisionTreeClassifier in order, like ("setosa", "versicolor", "virginica"), then store the counts as a tuple (12, 2, 30) instead of a Dict.

I haven't looked into JLD internals, so I'm not sure where it's wasteful... We could write a custom serializer?

bensadeghi commented 5 years ago

@opterix There have been some progress on this front with the DecisionTree v0.8.1 release (requires Julia v0.7-v1.0). The native data types Node and Leaf are now typed, and so you'll see more optimized model writing to disk. The JLD2.jl has been working well for this.

Note that even though features and labels of type Array{Any} are supported, it is highly recommended that data be cast to explicit types (ie with float.(), string.(), etc). This significantly improves model training and prediction execution times, and also drastically reduces the size of saved models.