JuliaAI / CatBoost.jl

Julia wrapper of the python library CatBoost for boosted decision trees
MIT License
11 stars 3 forks source link

Port to MLJ? #9

Closed azev77 closed 1 year ago

azev77 commented 3 years ago

Hey and thank you for this package! I've been hoping for a CatBoost interface for a while!!

Have you considered porting this to MLJ.jl? This would be an awesome addition as they currently support XGBoost & LightGBM. @ablaom @tlienart

BTW, I noticed some Julia wrappers wrap ML models in high level code (like Python/R). Other wrappers wrap the underlying low level code (eg GLMNet.jl wraps the Fortran code from glmnet.R). Wrapping the underlying CatBoost code would prob be a pain, but would there be a performance difference?

ablaom commented 3 years ago

For the record, there is also an MLJ interface for EvoTrees.jl, another pure julia implementation of gradient tree boosting. So this should make a good template for adding an MLJ interface to CatBoost.jl, I would expect. It includes, for example, an appropriate implementation of MLJ's update method, which makes "warm restarts" possible, and allows one to wrap these models in an iterative control strategy (eg, implement early stopping based on out-of-sample losses).

cc @jeremiedb

femtomc commented 3 years ago

Hi everyone -- thanks for the interest.

I suspect wrapping the low level code will be a pain. In terms of performance, of course a native wrapper would be faster than calling through the Python runtime -- but the performance penalty incurred by calling through the runtime should be negligible versus the time it takes for a model to train, etc. So we have no intention of trying to wrap the native C++ code (if CatBoost offers a C API -- this may change, although IIRC I don't think they export a C API).

We considered implementing the MLJ interface previously -- but ultimately decided that the way CatBoost does things and the way that MLJ does things are different enough that the impedance mismatch was not worth seriously trying to fix given our priorities. Our perspective then changed: this CatBoost.jl package would be a pure wrapper package -- and if someone wants to implement a MLJCatboost.jl package -- we would welcome it.

In particular, one point -- considering https://alan-turing-institute.github.io/MLJ.jl/dev/quick_start_guide_to_adding_models/#Model-type-and-constructor (the process of fitting with MLJ) --

Compare this to the (essentially API restricted) way of fitting CatBoost models:

# Create pools.
train = Pool(; data=x_train, label=y_train, group_id=queries_train)
test = Pool(; data=x_test, label=y_test, group_id=queries_test)

# small number of iterations to not slow down CI too much
default_parameters = Dict("iterations" => 10, "loss_function" => "RMSE",
                          "custom_metric" => ["MAP:top=10", "PrecisionAt:top=10",
                                              "RecallAt:top=10"], "verbose" => false,
                          "random_seed" => 314159)

function fit_model(params, train_pool, test_pool)
    model = catboost.CatBoost(params)
    model.fit(train_pool; eval_set=test_pool, plot=false)
    return model
end

Hyperparameters are passed over the line in Dict form to the Python runtime -- and there's a very large number of them available for customization by the user. So supporting a generic mutable CatBoostModel struct which satisfies the MLJ interfaces seemed more restrictive than just exposing this API to the user here.

Again, if either of you are interested in creating an MLJCatBoost wrapper library -- we would welcome it! But we are not prioritizing it.

Thank you.

ablaom commented 3 years ago

Comment to self: This is not a pure Julia implementation, but a wrap of python code (wrapping C, presumably).

ericphanson commented 3 years ago

Yep, a popular C++ library: https://github.com/catboost/catboost (if it were C, we might try to wrap it directly instead of its python interface). This is a pretty minimal wrapper that just uses PyCall and tries to make it a bit more convenient to send/receive tabular data.

ericphanson commented 1 year ago

I’d be interested in adding an MLJ interface directly to CatBoost.jl here; I think it would add a lot of value. I bet we can find a way to make the interfaces work.

azev77 commented 1 year ago

Any progress?

ericphanson commented 1 year ago

No, sorry. I was writing a new model and was thinking about doing it with CatBoost but ended up going with XGBoost after a quick check showed similar perf in this case. (I have seen CatBoost do noticeably better in other cases though). Hopefully we can find time to do it at some point, but for now it's not a priority for me.

ablaom commented 1 year ago

BTW, it looks like XGBoost.jl is getting a well-needed rewrite. https://github.com/dmlc/XGBoost.jl/pull/111

🤞🏾

ericphanson commented 1 year ago

Closed by #16

v0.3.0 will have MLJ integration thanks to @tylerjthomas9 and @ablaom !

azev77 commented 1 year ago

It would be great if the MLJ docs were updated to reflect this.

ablaom commented 1 year ago

Will happen when I update the model registry shortly. I'll re-open this to flag this hasn't happened yet.

ablaom commented 1 year ago

Oh, I can't reopen. I'll create the issue at MLJModels now instead.