forestry-labs / Rforestry

https://forestry-labs.github.io/Rforestry/
34 stars 11 forks source link

Otuput a Rforestry tree in `treelite` format so that inferencing is blazing fast #33

Open linanqiu opened 1 year ago

linanqiu commented 1 year ago

treelite is a format for serializing trees for prediction only. It takes xgboost, lightgbm, and sklearn trees out of the box. It basically copies the internal structure of a tree, converts it into blazingly fast C code and makes predicting through the final structure extremely fast. There's also a CUDA treelite wrapper that converts any treelite model into one that works on GPUs.

One can also construct a treelite model from a custom trained model. All one needs are the following details for each tree: split points, feature at split, numerical vs categorical, and leaf nodes.

If we can get Rforestry to dump its internal structure into a JSON or something like that, I can work with that to convert it into a treelite tree. That'd give us most prediction performance perks + sklearn perks.

@JasjeetSekhon @theo-s

linanqiu commented 1 year ago

Yup just confirmed that this works with model matrix as well if we basically (ab)use multiclass leaf vectors and make each leaf node an indicator vector of observation indices lol.

linanqiu commented 1 year ago

plottree.R has enough examples for me to get started on dumping the tree structure to exactly what treelite needs. Let me play with it.

linanqiu commented 1 year ago

@theo-s @JasjeetSekhon I just realized that in order to fully implement doubleOOB, treelite's predict method would have to take in an additional "treesToExclude" vector (or "treesToInclude"). Otherwise it will take the average of all the trees by default. That requires changing the API significantly in treelite and possibly the tree compilation.