JuliaAI / DecisionTree.jl

Julia implementation of Decision Tree (CART) and Random Forest algorithms
Other
349 stars 101 forks source link

Adding a new field to the `Leaf` API to support sample weights #72

Open Eight1911 opened 6 years ago

Eight1911 commented 6 years ago

I'm thinking of adding support for sample weights, which isn't compatible with the current Leaf struct. More specifically, the field values in the struct lists every label that falls into the leaf with multiplicity, but does not give the weight of each label in the list. To add support for sample weights, I propose that we do either of the following:

bensadeghi commented 6 years ago

Though I do like the approach mentioned, my main concern is that it would potentially make the already heavy Leaf (and generally, tree) even heavier. Maybe a tuple would be lighter than a Dict. I would say that the top issue to be resolved is #44 , where a simple tree takes up GBs (!!) of space on disk, using JLD.jl or BSON.jl . I've been meaning to test how well trees made of NodeMeta type would write to disk, as they employ compact counts of the labels. And if they do write well, then we should consider modifying the current Leaf and Node types, or dropping them all together and adopt NodeMeta instead. It would be good to experiment with the approaches you mentioned and see which better reduces the size on disk.

Eight1911 commented 6 years ago

I see how classification trees can be compactified using the data structure from NodeMeta. However, the case seems a little harder for regression trees where there may be as many labels as there are data points.

One property that may help make creating more compact regression trees is the fact that for node::NodeMeta, indX[node.region] already gives the index of every sample that falls into node. Considering that Y[indX[node.region]] == Y[indX][node.region], we might store just the single array tree.labels = Y[indX] at the top level, and store node.region at the node level. With this, we can recover the labels for in each node by taking tree.labels[node.region]. Since we only need a single array, this may cut the overhead of having one array per Leaf.

bensadeghi commented 6 years ago

Sounds good. Would love to see how well it writes to disk using JLD and BSON.

baggepinnen commented 4 years ago

Has there been any progress on adding weights to samples? It would be an awesome feature to have :) I can see weights appearing in the code but there is no interface for the user to specify them. They seem to be used internally to build boosting stumps? Would it be possible to expose an API where the user can supply a vector of weights when building a tree or a forest?

Edit: I'm working on a PR to add support for this