alegonz / baikal

A graph-based functional API for building complex scikit-learn pipelines.
https://baikal.readthedocs.io
BSD 3-Clause "New" or "Revised" License
592 stars 30 forks source link

CatBoost Library is complaining about unhashable class #37

Closed ragrawal closed 4 years ago

ragrawal commented 4 years ago

What is the bug?

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-83c5a63cefff> in <module>
     19 xgbStep = make_step(CatBoostClassifier)()(x, y)
     20 model = Model(x, xgbStep, y)
---> 21 model.fit(dataset[:,0:8], dataset[:,8])

/usr/local/anaconda3/envs/interview/lib/python3.6/site-packages/baikal/_core/model.py in fit(self, X, y, **fit_params)
    412 
    413             ys = [results_cache[t] for t in node.targets]
--> 414             fit_params = fit_params_steps.get(node.step, {})
    415 
    416             if node.fit_compute_func is not None:

TypeError: unhashable type: 'CatBoostClassifier'

How to reproduce it?

import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from baikal import Input, Model, make_step, Step
from baikal.plot import plot_model
from baikal.steps import Stack
from catboost import CatBoostClassifier

# load data
df = pd.read_csv(
    'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv', 
    header=None)
dataset = df.values

x = Input()
y = Input()

xgbStep = make_step(CatBoostClassifier)()(x, y)
model = Model(x, xgbStep, y)
model.fit(dataset[:,0:8], dataset[:,8])
ragrawal commented 4 years ago

I was able to fix the issue using the following code. However not sure if this the right approach or not

import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from baikal import Input, Model, make_step, Step
from baikal.plot import plot_model
from baikal.steps import Stack
from catboost import CatBoostClassifier

# load data
df = pd.read_csv(
    'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv', 
    header=None)
dataset = df.values

class CatBoostClassifierStep(Step, CatBoostClassifier):
    def __init__(self, *args, name=None, n_outputs=1, **kwargs):
        super().__init__(*args, name=name, n_outputs=n_outputs, **kwargs)

    def __hash__(self):
        return hash(super().name)

x = Input()
y = Input()

xgbStep = CatBoostClassifierStep()(x, y)
model = Model(x, xgbStep, y)
model.fit(dataset[:,0:8], dataset[:,8])
alegonz commented 4 years ago

@ragrawal Thank you for the bug report!

Indeed that's a bug in Model.fit, I'll see what I can do about it. I think I can release a fix for it in 0.4.2. In the meantime, please use that workaround you pasted which, though a bit cumbersome, is valid and seems to be the most sensible approach.

ragrawal commented 4 years ago

hi Alegonz, just found out that above solution doesn't work very well with serialization. If I serialize my trained model and then try to read it back, I get following errorr 'CatBoostClassifierStep' object has no attribute '_nodes'