EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.58k stars 1.55k forks source link

Add CatBoost #822

Open annaveronika opened 5 years ago

annaveronika commented 5 years ago

CatBoost is a gradient boosting library that gives state of the art results on datasets with categorical features and also on many datasets without categorical features. So it makes sense to add it here. https://catboost.ai https://github.com/catboost/catboost

GinoWoz1 commented 5 years ago

Hey Anna, it would be great to add in. Not sure if this solution works for catboost but someone else found a way to add more operators not in the default config.

https://github.com/EpistasisLab/tpot/issues/407

annaveronika commented 5 years ago

If you want only numerical features then it can be done the same way as XGBoost, which is already included. If you want categorical features then you need to do a little bit more - you need to pass parameter with categorical feature indices to estimator creation or to the fit function.

jhmenke commented 5 years ago

is there an update on this? if not, i would probably look into it over the next months

weixuanfu commented 5 years ago

@jhmenke we don't have updates on this so far. Please let us know your findings. Thanks.

gregarious9612 commented 5 years ago

Hi guys, I've tried the proposed way of only running tpot over catboost shown in #407 ,and modified it accordingly for a classifier problem. However, tpot still goes through other models instead of only catboost. Not sure if anyone also had the same issue.

jhmenke commented 5 years ago

Hi guys, I've tried the proposed way of only running tpot over catboost shown in #407 ,and modified it accordingly for a classifier problem. However, tpot still goes through other models instead of only catboost. Not sure if anyone also had the same issue.

Can you post your classifier dict and code sample? For me the method did work, but the issue right now is that there is no feasible way of passing the cat_columns to catboost.

annaveronika commented 5 years ago

It is now possible to pass cat_features together with other training parameters. So there should be no problem with them.

jhmenke commented 5 years ago

Then this should suffice, to be added to the default regressor dict (analogous for classifier)

cat_features = [...]  # e.g. features.select_dtypes(include=["category"])
if 'catboost' in sys.modules.keys():
    from sklearn.base import RegressorMixin
    from catboost import CatBoostRegressor
    CatBoostRegressor.__bases__ += (RegressorMixin,)
    regressor_config_dict['catboost.CatBoostRegressor'] = {
        'logging_level': ['Silent'],
        'cat_features': [cat_features],
    }
annaveronika commented 4 years ago

So is there a plan to add CatBoost? It now supports text features along with categorical ones.

jhmenke commented 4 years ago

do the catboost classes, e.g., CatBoostRegressor now derive from the sklearn RegressorMixin?

Afterwards catboost could simply be added to the default configs in tpot.

annaveronika commented 4 years ago

No, but I think all the needed methods should be in place.

jhmenke commented 4 years ago

Then i think a reasonable solution would be to make an example notebook with Catboost, but not add it to the default configuration since cat_features and the import need to be coded manually.

annaveronika commented 4 years ago

Actually you don't have to pass cat_features if you don't have them, you can use the library without categorical features

kmedved commented 4 years ago

I would likewise be interested in seeing Catboost added in the default configuration; even without categorical variables, I've found it has comparable performance to xgboost after tuning, and frequently outperforms when both are run using default settings. Especially for pipelines with shorter runtimes, it could be a real value added.

annaveronika commented 4 years ago

https://github.com/catboost/catboost/issues/696#issuecomment-627258634 - catboost performs well in other auto ml packages