[BUG] `CascadeForestRegressor` somehow cannot be inserted into a DataFrame

IncubatorShokuhou commented 3 years ago

Describe the bug CascadeForestRegressor somehow cannot be inserted into a DataFrame

To Reproduce

import pandas as pd
from deepforest import CascadeForestRegressor
from ngboost import NGBRegressor

ngr = NGBRegressor()  # ngboost regressor for example. xgb, lgb should also be no problem.
cfr = CascadeForestRegressor()
df= pd.DataFrame()

# somehow OK
df.insert(0, "ngr", [ngr])
# somehow error
df.insert(0, "cf", [cforest])

Expected behavior No error

Additional context

ValueError                                Traceback (most recent call last)
<ipython-input-32-ab0139d10254> in <module>
----> 1 df.insert(0, "cf", [cforest])

/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/pandas/core/frame.py in insert(self, loc, column, value, allow_duplicates)
   3760             )
   3761         self._ensure_valid_index(value)
-> 3762         value = self._sanitize_column(column, value, broadcast=False)
   3763         self._mgr.insert(loc, column, value, allow_duplicates=allow_duplicates)
   3764 

/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
   3900             if not isinstance(value, (np.ndarray, Index)):
   3901                 if isinstance(value, list) and len(value) > 0:
-> 3902                     value = maybe_convert_platform(value)
   3903                 else:
   3904                     value = com.asarray_tuplesafe(value)

/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in maybe_convert_platform(values)
    110     """ try to do platform conversion, allow ndarray or list here """
    111     if isinstance(values, (list, tuple, range)):
--> 112         values = construct_1d_object_array_from_listlike(values)
    113     if getattr(values, "dtype", None) == np.object_:
    114         if hasattr(values, "_values"):

/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in construct_1d_object_array_from_listlike(values)
   1636     # making a 1D array that contains list-likes is a bit tricky:
   1637     result = np.empty(len(values), dtype="object")
-> 1638     result[:] = values
   1639     return result
   1640 

/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/deepforest/cascade.py in __getitem__(self, index)
    518 
    519     def __getitem__(self, index):
--> 520         return self._get_layer(index)
    521 
    522     def _get_n_output(self, y):

/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/deepforest/cascade.py in _get_layer(self, layer_idx)
    561             logger.debug("self.n_layers_ = "+ str(self.n_layers_))
    562             logger.debug("layer_idx = "+ str(layer_idx))
--> 563             raise ValueError(msg.format(self.n_layers_ - 1, layer_idx))
    564 
    565         layer_key = "layer_{}".format(layer_idx)

ValueError: The layer index should be in the range [0, 1], but got 2 instead.

This bug can be simpliy fixed if we change if not 0 <= layer_idx < self.n_layers_: to if not 0 <= layer_idx <= self.n_layers_:, but I still don't know the cause of this error and whether this fix is corret.

xuyxu commented 3 years ago

Hi @IncubatorShokuhou, I would like to ask that what is the purpose of storing the model in a pandas dataframe?

IncubatorShokuhou commented 3 years ago

@xuyxu Actually I am trying to integrate deep-forest into PyCaret. In theory, PyCaret supports all ml algorithms with scikit-learn-Compatible API. In practice, most models, including xgboost, lightgbm, catboost, ngboost, explainable boosting matching et al. can be easily integrated.

Here is the example code:

from pycaret.datasets import get_data
boston = get_data('boston')
from pycaret.regression import *
from deepforest import CascadeForestRegressor
from ngboost import NGBRegressor

# setup, data preprocessing
exp_name = setup(data = boston,  target = 'medv',silent = True)

# establish regressors
ngr = NGBRegressor()
ngboost = create_model(ngr)

cfr = CascadeForestRegressor()
casforest = create_model(cfr)

# compare models
best_model = compare_models(include=[ngboost,casforest,"xgboost","lightgbm"])

# save model
save_model(best_model , 'best_model ')

During the integration, I met 2 errors: 1. the Deep-Forest only accepts np.array, and cannot input pd.DataFrame, which could be easily fixed by https://github.com/LAMDA-NJU/Deep-Forest/pull/86 . 2. In line 2219 of https://github.com/pycaret/pycaret/blob/c76f4b7699474bd16a2e2a6d0f52759ae29898b6/pycaret/internal/tabular.py#L2219 , the model object is put into a pd.DataFrame, and the bug described above happened, which is quite weird for me.

I guess there might be something wrong with the initialization. Wish you could give me some suggestions.

xuyxu commented 3 years ago

Thanks for your kind explanations! I will take a look at your PR first ;-)

IncubatorShokuhou commented 3 years ago

BTW, could you please telling me why a local implementation of RandomForestClassifier instead of sklearn.ensemble.RandomForestClassifier is used in line 50 of https://github.com/LAMDA-NJU/Deep-Forest/blob/master/deepforest/cascade.py#L50 . And in https://github.com/LAMDA-NJU/Deep-Forest/blob/master/deepforest/cascade.py#91, is lgb = __import__("lightgbm.sklearn") simply equal to import lightgbm.sklearn as lgb ?

xuyxu commented 3 years ago

why a local implementation of RandomForestClassifier instead of sklearn.ensemble.RandomForestClassifier is used

sklearn.ensemble.RandomForestClassifier is too slow when fitted on large datasets with millions of samples

lgb = import("lightgbm.sklearn")

We prefer to treat lightgbm as a soft dependency. If we use import lightgbm.sklearn as lgb in the front, the program will raise an ImportError if lightgbm is not installed, which is not the case we want.

IncubatorShokuhou commented 3 years ago

why a local implementation of RandomForestClassifier instead of sklearn.ensemble.RandomForestClassifier is used

sklearn.ensemble.RandomForestClassifier is too slow when fitted on large datasets with millions of samples

lgb = import("lightgbm.sklearn")

We prefer to treat lightgbm as a soft dependency. If we use import lightgbm.sklearn as lgb in the front, the program will raise an ImportError if lightgbm is not installed, which is not the case we want.

I see. So maybe I can write a simple GPU version for the three models using cuML.ensemble.RandomForest and gpu_hist ?

xuyxu commented 3 years ago

The performance would be much worse since Random Forest in cuML is not designed for the case where we want the forest to be as complex as possible (it does not support unlimited tree depth).

IncubatorShokuhou commented 3 years ago

The performance would be much worse since Random Forest in cuML is not designed for the case where we want the forest to be as complex as possible (it does not support unlimited tree depth).

OK, I see.

IncubatorShokuhou commented 3 years ago

@xuyxu I think I have figure out the reason of this error. In https://github.com/pandas-dev/pandas/blob/64559124a4de977e1d5cd09e6d80fbb110d3a6ea/pandas/core/dtypes/cast.py#112 , pandas will first identify whether the object has a __ len__ method. If true, pandas will try to transform this list-like object(aka CascadeForestRegressor()) in a 1-dimensional numpy array of object dtype via construct_1d_object_array_from_listlike in https://github.com/pandas-dev/pandas/blob/64559124a4de977e1d5cd09e6d80fbb110d3a6ea/pandas/core/dtypes/cast.py#L1970 . So this error actually occur in

result = np.empty(0, dtype="object")
result[:] = CascadeForestRegressor()

and when trying to put CascadeForestRegressor() into a empty np.array, __getitem__ in https://github.com/LAMDA-NJU/Deep-Forest/blob/master/deepforest/cascade.py#540 is called, then the error occured.

Actually, the error can be more significantly reproduced in another way:

# basic example
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from deepforest import CascadeForestClassifier

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
model = CascadeForestClassifier(random_state=1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred) * 100
print("\nTesting Accuracy: {:.3f} %".format(acc))

# now the model have 2 layers. Iterate it.
for i,j in enumerate(model):
    print("i = ")
    print(i)
    print("j = ")
    print(j)
    print("ok")

and here is the error:

i = 
0
j = 
ClassificationCascadeLayer(buffer=<deepforest._io.Buffer object at 0x7efa7e4f5fd0>,
                           criterion='gini', layer_idx=0, n_estimators=4,
                           n_outputs=10, random_state=1)
ok
i = 
1
j = 
ClassificationCascadeLayer(buffer=<deepforest._io.Buffer object at 0x7efa7e4f5fd0>,
                           criterion='gini', layer_idx=1, n_estimators=4,
                           n_outputs=10, random_state=1)
ok
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-29-c9f04ba43562> in <module>
----> 1 for i,j in enumerate(model):
      2     print("i = ")
      3     print(i)
      4     print("j = ")
      5     print(j)

/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/deepforest/cascade.py in __getitem__(self, index)
    518 
    519     def __getitem__(self, index):
--> 520         return self._get_layer(index)
    521 
    522     def _get_n_output(self, y):

/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/deepforest/cascade.py in _get_layer(self, layer_idx)
    561             logger.debug("self.n_layers_ = "+ str(self.n_layers_))
    562             logger.debug("layer_idx = "+ str(layer_idx))
--> 563             raise ValueError(msg.format(self.n_layers_ - 1, layer_idx))
    564 
    565         layer_key = "layer_{}".format(layer_idx)

ValueError: The layer index should be in the range [0, 1], but got 2 instead.

Then I noticed that https://docs.python.org/zh-cn/3/reference/datamodel.html#object.__setitem__ introduces:

Note for loops expect that an IndexError will be raised for illegal indexes to allow proper detection of the end of the sequence.

That's it! Deep-Forest raises a ValueError insted of IndexError by mistake. When I changed it, everything is ok!

IncubatorShokuhou commented 3 years ago

I am going to create a PR and fix this error ASAP.

LAMDA-NJU / Deep-Forest

[BUG] `CascadeForestRegressor` somehow cannot be inserted into a DataFrame #87