Closed IncubatorShokuhou closed 3 years ago
Hi @IncubatorShokuhou, I would like to ask that what is the purpose of storing the model in a pandas dataframe?
@xuyxu Actually I am trying to integrate deep-forest into PyCaret. In theory, PyCaret supports all ml algorithms with scikit-learn-Compatible API. In practice, most models, including xgboost, lightgbm, catboost, ngboost, explainable boosting matching et al. can be easily integrated.
Here is the example code:
from pycaret.datasets import get_data
boston = get_data('boston')
from pycaret.regression import *
from deepforest import CascadeForestRegressor
from ngboost import NGBRegressor
# setup, data preprocessing
exp_name = setup(data = boston, target = 'medv',silent = True)
# establish regressors
ngr = NGBRegressor()
ngboost = create_model(ngr)
cfr = CascadeForestRegressor()
casforest = create_model(cfr)
# compare models
best_model = compare_models(include=[ngboost,casforest,"xgboost","lightgbm"])
# save model
save_model(best_model , 'best_model ')
During the integration, I met 2 errors: 1. the Deep-Forest only accepts np.array, and cannot input pd.DataFrame, which could be easily fixed by https://github.com/LAMDA-NJU/Deep-Forest/pull/86 . 2. In line 2219 of https://github.com/pycaret/pycaret/blob/c76f4b7699474bd16a2e2a6d0f52759ae29898b6/pycaret/internal/tabular.py#L2219 , the model object is put into a pd.DataFrame, and the bug described above happened, which is quite weird for me.
I guess there might be something wrong with the initialization. Wish you could give me some suggestions.
Thanks for your kind explanations! I will take a look at your PR first ;-)
BTW, could you please telling me why a local implementation of RandomForestClassifier
instead of sklearn.ensemble.RandomForestClassifier
is used in line 50 of https://github.com/LAMDA-NJU/Deep-Forest/blob/master/deepforest/cascade.py#L50 . And in https://github.com/LAMDA-NJU/Deep-Forest/blob/master/deepforest/cascade.py#91, is lgb = __import__("lightgbm.sklearn")
simply equal to import lightgbm.sklearn as lgb
?
why a local implementation of RandomForestClassifier instead of sklearn.ensemble.RandomForestClassifier is used
sklearn.ensemble.RandomForestClassifier
is too slow when fitted on large datasets with millions of samples
lgb = import("lightgbm.sklearn")
We prefer to treat lightgbm as a soft dependency. If we use import lightgbm.sklearn as lgb
in the front, the program will raise an ImportError if lightgbm is not installed, which is not the case we want.
why a local implementation of RandomForestClassifier instead of sklearn.ensemble.RandomForestClassifier is used
sklearn.ensemble.RandomForestClassifier
is too slow when fitted on large datasets with millions of sampleslgb = import("lightgbm.sklearn")
We prefer to treat lightgbm as a soft dependency. If we use
import lightgbm.sklearn as lgb
in the front, the program will raise an ImportError if lightgbm is not installed, which is not the case we want.
I see. So maybe I can write a simple GPU version for the three models using cuML.ensemble.RandomForest
and gpu_hist
?
The performance would be much worse since Random Forest in cuML is not designed for the case where we want the forest to be as complex as possible (it does not support unlimited tree depth).
The performance would be much worse since Random Forest in cuML is not designed for the case where we want the forest to be as complex as possible (it does not support unlimited tree depth).
OK, I see.
@xuyxu I think I have figure out the reason of this error.
In https://github.com/pandas-dev/pandas/blob/64559124a4de977e1d5cd09e6d80fbb110d3a6ea/pandas/core/dtypes/cast.py#112 , pandas
will first identify whether the object has a __ len__
method. If true, pandas
will try to transform this list-like object(aka CascadeForestRegressor()) in a 1-dimensional numpy array of object dtype via construct_1d_object_array_from_listlike
in https://github.com/pandas-dev/pandas/blob/64559124a4de977e1d5cd09e6d80fbb110d3a6ea/pandas/core/dtypes/cast.py#L1970 .
So this error actually occur in
result = np.empty(0, dtype="object")
result[:] = CascadeForestRegressor()
and when trying to put CascadeForestRegressor()
into a empty np.array
, __getitem__
in https://github.com/LAMDA-NJU/Deep-Forest/blob/master/deepforest/cascade.py#540 is called, then the error occured.
Actually, the error can be more significantly reproduced in another way:
# basic example
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from deepforest import CascadeForestClassifier
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
model = CascadeForestClassifier(random_state=1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred) * 100
print("\nTesting Accuracy: {:.3f} %".format(acc))
# now the model have 2 layers. Iterate it.
for i,j in enumerate(model):
print("i = ")
print(i)
print("j = ")
print(j)
print("ok")
and here is the error:
i =
0
j =
ClassificationCascadeLayer(buffer=<deepforest._io.Buffer object at 0x7efa7e4f5fd0>,
criterion='gini', layer_idx=0, n_estimators=4,
n_outputs=10, random_state=1)
ok
i =
1
j =
ClassificationCascadeLayer(buffer=<deepforest._io.Buffer object at 0x7efa7e4f5fd0>,
criterion='gini', layer_idx=1, n_estimators=4,
n_outputs=10, random_state=1)
ok
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-29-c9f04ba43562> in <module>
----> 1 for i,j in enumerate(model):
2 print("i = ")
3 print(i)
4 print("j = ")
5 print(j)
/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/deepforest/cascade.py in __getitem__(self, index)
518
519 def __getitem__(self, index):
--> 520 return self._get_layer(index)
521
522 def _get_n_output(self, y):
/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/deepforest/cascade.py in _get_layer(self, layer_idx)
561 logger.debug("self.n_layers_ = "+ str(self.n_layers_))
562 logger.debug("layer_idx = "+ str(layer_idx))
--> 563 raise ValueError(msg.format(self.n_layers_ - 1, layer_idx))
564
565 layer_key = "layer_{}".format(layer_idx)
ValueError: The layer index should be in the range [0, 1], but got 2 instead.
Then I noticed that https://docs.python.org/zh-cn/3/reference/datamodel.html#object.__setitem__ introduces:
Note for loops expect that an IndexError will be raised for illegal indexes to allow proper detection of the end of the sequence.
That's it! Deep-Forest
raises a ValueError
insted of IndexError
by mistake. When I changed it, everything is ok!
I am going to create a PR and fix this error ASAP.
Describe the bug
CascadeForestRegressor
somehow cannot be inserted into a DataFrameTo Reproduce
Expected behavior No error
Additional context
This bug can be simpliy fixed if we change
if not 0 <= layer_idx < self.n_layers_:
toif not 0 <= layer_idx <= self.n_layers_:
, but I still don't know the cause of this error and whether this fix is corret.