ClimbsRocks / auto_ml

[UNMAINTAINED] Automated machine learning for analytics & production
http://auto-ml.readthedocs.io
MIT License
1.64k stars 311 forks source link

model_names - some remarks #232

Open mglowacki100 opened 7 years ago

mglowacki100 commented 7 years ago
  1. In documentation http://auto-ml.readthedocs.io/en/latest/api_docs_for_geeks.html I'd listed separately models for regression and classification eg. Models for regression:
    
    model_names=[
                           'ARDRegression', #slow
                           'AdaBoostRegressor', 
                           'BayesianRidge', 
                           'ElasticNet', 
                           'ExtraTreesRegressor',  
                           'GradientBoostingRegressor', 
                           'Lasso', 
                           'LassoLars', 
                           'LinearRegression',   
                           'OrthogonalMatchingPursuit', 
                           'PassiveAggressiveRegressor', 
                           'RANSACRegressor', 
                           'RandomForestRegressor', 
                           'Ridge', 
                           'SGDRegressor', 
                            #non-scikit models:
                           'DeepLearningRegressor', #gpu support
                           #'LGBMRegressor', #!!!problem key
                           'XGBRegressor' #!!!problem
                           ]
2. I've noticed that 'XGBRegressor' requires (or I do something wrong):
**`optimize_final_model=False'** or I got error:

About to run GridSearchCV on the pipeline for the model XGBRegressor to predict y Fitting 2 folds for each of 8 candidates, totalling 16 fits Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 699, in runfile execfile(filename, namespace) File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 88, in execfile exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace) File "/home/mglowacki/Desktop/Mercedes/automl_merc.py", line 62, in 'XGBRegressor' #!!!problem File "/usr/local/lib/python3.5/dist-packages/auto_ml/predictor.py", line 469, in train self.trained_final_model = self.train_ml_estimator(estimator_names, self._scorer, X_df, y) File "/usr/local/lib/python3.5/dist-packages/auto_ml/predictor.py", line 704, in train_ml_estimator gscv_results = self.fit_grid_search(X_df, y, grid_search_params, feature_learning=feature_learning) File "/usr/local/lib/python3.5/dist-packages/auto_ml/predictor.py", line 630, in fit_grid_search gs.fit(X_df, y) File "/usr/local/lib/python3.5/dist-packages/sklearn/model_selection/_search.py", line 945, in fit return self._fit(X, y, groups, ParameterGrid(self.param_grid)) File "/usr/local/lib/python3.5/dist-packages/sklearn/model_selection/_search.py", line 564, in _fit for parameters in parameter_iterable File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/parallel.py", line 768, in call self.retrieve() File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/parallel.py", line 719, in retrieve raise exception File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/parallel.py", line 682, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get raise self._value File "/usr/lib/python3.5/multiprocessing/pool.py", line 385, in _handle_tasks put(task) File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/pool.py", line 371, in send CustomizablePickler(buffer, self._reducers).dump(obj) _pickle.PicklingError: Can't pickle <class 'xgboost.sklearn.XGBRegressor'>: it's not the same object as xgboost.sklearn.XGBRegressor


**'ml_for_analytics=False'** or I got error:

Here are the results from our XGBRegressor predicting y Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/auto_ml/predictor.py", line 944, in _print_ml_analytics_results_random_forest trained_feature_importances = final_model_obj.model.featureimportances AttributeError: 'XGBRegressor' object has no attribute 'featureimportances'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 699, in runfile execfile(filename, namespace) File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 88, in execfile exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace) File "/home/mglowacki/Desktop/Mercedes/automl_merc.py", line 62, in 'XGBRegressor' #!!!problem File "/usr/local/lib/python3.5/dist-packages/auto_ml/predictor.py", line 469, in train self.trained_final_model = self.train_ml_estimator(estimator_names, self._scorer, X_df, y) File "/usr/local/lib/python3.5/dist-packages/auto_ml/predictor.py", line 671, in train_ml_estimator trained_final_model = self.fit_single_pipeline(X_df, y, estimator_names[0], feature_learning=feature_learning) File "/usr/local/lib/python3.5/dist-packages/auto_ml/predictor.py", line 553, in fit_single_pipeline self.print_results(model_name) File "/usr/local/lib/python3.5/dist-packages/auto_ml/predictor.py", line 577, in print_results self._print_ml_analytics_results_random_forest() File "/usr/local/lib/python3.5/dist-packages/auto_ml/predictor.py", line 947, in _print_ml_analytics_results_random_forest trained_feature_importances = final_model_obj.model.featureimportance AttributeError: 'XGBRegressor' object has no attribute 'featureimportance'

ClimbsRocks commented 7 years ago

Thanks for filing the issue! I'll look into it in more depth later, but as a quick fix, try upgrading XGBoost. featureimportances is a relatively new attribute.

If that doesn't do it, it's likely that they changed their API without a deprecation notice. The build I have on continuous integration passes all the tests. But that relies on a Docker image where we built xgboost a few weeks back.

Let me know if updating to their latest (and auto_ml's latest) version works. If not, I'll look into this more.

Thanks for the detailed issue! Would love to hear any other feedback you have.

On Sat, Jun 3, 2017 at 3:15 AM mglowacki100 notifications@github.com wrote:

  1. In documentation http://auto-ml.readthedocs.io/en/latest/api_docs_for_geeks.html I'd listed separately models for regression and classification eg. Models for regression:

model_names=[ 'ARDRegression', #slow 'AdaBoostRegressor', 'BayesianRidge', 'ElasticNet', 'ExtraTreesRegressor', 'GradientBoostingRegressor', 'Lasso', 'LassoLars', 'LinearRegression', 'OrthogonalMatchingPursuit', 'PassiveAggressiveRegressor', 'RANSACRegressor', 'RandomForestRegressor', 'Ridge', 'SGDRegressor',

non-scikit models:

                       'DeepLearningRegressor', #gpu support
                       #'LGBMRegressor', #!!!problem key
                       'XGBRegressor' #!!!problem
                       ]
  1. I've noticed that 'XGBRegressor' requires (or I do something wrong): `optimize_final_model=False' or I got error:

About to run GridSearchCV on the pipeline for the model XGBRegressor to predict y Fitting 2 folds for each of 8 candidates, totalling 16 fits Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 699, in runfile execfile(filename, namespace) File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 88, in execfile exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace) File "/home/mglowacki/Desktop/Mercedes/automl_merc.py", line 62, in 'XGBRegressor' #!!!problem File "/usr/local/lib/python3.5/dist-packages/auto_ml/predictor.py", line 469, in train self.trained_final_model = self.train_ml_estimator(estimator_names, self._scorer, X_df, y) File "/usr/local/lib/python3.5/dist-packages/auto_ml/predictor.py", line 704, in train_ml_estimator gscv_results = self.fit_grid_search(X_df, y, grid_search_params, feature_learning=feature_learning) File "/usr/local/lib/python3.5/dist-packages/auto_ml/predictor.py", line 630, in fit_grid_search gs.fit(X_df, y) File "/usr/local/lib/python3.5/dist-packages/sklearn/model_selection/_search.py", line 945, in fit return self._fit(X, y, groups, ParameterGrid(self.param_grid)) File "/usr/local/lib/python3.5/dist-packages/sklearn/model_selection/_search.py", line 564, in _fit for parameters in parameter_iterable File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/parallel.py", line 768, in call self.retrieve() File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/parallel.py", line 719, in retrieve raise exception File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/parallel.py", line 682, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get raise self._value File "/usr/lib/python3.5/multiprocessing/pool.py", line 385, in _handle_tasks put(task) File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/pool.py", line 371, in send CustomizablePickler(buffer, self._reducers).dump(obj) _pickle.PicklingError: Can't pickle <class 'xgboost.sklearn.XGBRegressor'>: it's not the same object as xgboost.sklearn.XGBRegressor

'ml_for_analytics=False' or I got error:

Here are the results from our XGBRegressor predicting y Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/auto_ml/predictor.py", line 944, in _print_ml_analytics_results_random_forest trained_feature_importances = final_model_obj.model.featureimportances AttributeError: 'XGBRegressor' object has no attribute 'featureimportances'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 699, in runfile execfile(filename, namespace) File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 88, in execfile exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace) File "/home/mglowacki/Desktop/Mercedes/automl_merc.py", line 62, in 'XGBRegressor' #!!!problem File "/usr/local/lib/python3.5/dist-packages/auto_ml/predictor.py", line 469, in train self.trained_final_model = self.train_ml_estimator(estimator_names, self._scorer, X_df, y) File "/usr/local/lib/python3.5/dist-packages/auto_ml/predictor.py", line 671, in train_ml_estimator trained_final_model = self.fit_single_pipeline(X_df, y, estimator_names[0], feature_learning=feature_learning) File "/usr/local/lib/python3.5/dist-packages/auto_ml/predictor.py", line 553, in fit_single_pipeline self.print_results(model_name) File "/usr/local/lib/python3.5/dist-packages/auto_ml/predictor.py", line 577, in print_results self._print_ml_analytics_results_random_forest() File "/usr/local/lib/python3.5/dist-packages/auto_ml/predictor.py", line 947, in _print_ml_analytics_results_random_forest trained_feature_importances = final_model_obj.model.featureimportance AttributeError: 'XGBRegressor' object has no attribute 'featureimportance'

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ClimbsRocks/auto_ml/issues/232, or mute the thread https://github.com/notifications/unsubscribe-auth/AGsSVUUnc4ylQJy8vwUQ7XwfL2qbktcQks5sATJFgaJpZM4NvC12 .

mglowacki100 commented 7 years ago

Thanks for help! It was necesarry to build xgboost from scratch (pip version is not fresh enough). I have auto_ml 2.1.9. Xgboost update solves problem with feature_importances_ for ml_for_analytics=True but for 'optimize_final_model=True' I've got different error than before:

********************************************************************************************
About to run GridSearchCV on the pipeline for the model XGBRegressor to predict y
Fitting 2 folds for each of 8 candidates, totalling 16 fits
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 699, in runfile
    execfile(filename, namespace)
  File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 88, in execfile
    exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)
  File "/home/mglowacki/Desktop/Mercedes/automl_merc.py", line 63, in <module>
    'XGBRegressor' #!!!problem
  File "/usr/local/lib/python3.5/dist-packages/auto_ml/predictor.py", line 469, in train
    self.trained_final_model = self.train_ml_estimator(estimator_names, self._scorer, X_df, y)
  File "/usr/local/lib/python3.5/dist-packages/auto_ml/predictor.py", line 704, in train_ml_estimator
    gscv_results = self.fit_grid_search(X_df, y, grid_search_params, feature_learning=feature_learning)
  File "/usr/local/lib/python3.5/dist-packages/auto_ml/predictor.py", line 630, in fit_grid_search
    gs.fit(X_df, y)
  File "/usr/local/lib/python3.5/dist-packages/sklearn/model_selection/_search.py", line 945, in fit
    return self._fit(X, y, groups, ParameterGrid(self.param_grid))
  File "/usr/local/lib/python3.5/dist-packages/sklearn/model_selection/_search.py", line 550, in _fit
    base_estimator = clone(self.estimator)
  File "/usr/local/lib/python3.5/dist-packages/sklearn/base.py", line 69, in clone
    new_object_params[name] = clone(param, safe=False)
  File "/usr/local/lib/python3.5/dist-packages/sklearn/base.py", line 126, in clone
    (estimator, name))
RuntimeError: Cannot clone object XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=200,
       n_jobs=1, nthread=-1, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=0, silent=True,
       subsample=1), as the constructor does not seem to set parameter n_jobs

There is a similar issue, but without solution: https://stackoverflow.com/questions/37646034/scikit-learn-cannot-clone-object-as-the-constructor-does-not-seem-to-set-par

ClimbsRocks commented 7 years ago

Looks like XGBoost made a change that sklearn's GSCV isn't liking. I'm hoping @gaw89 might have an easy fix in XGBoost for this. In the meantime, you can try checking out the code before this commit and using that.

Thanks for letting me know about this @mglowacki100 !

mglowacki100 commented 7 years ago

I've attached my code and dataset: my_dataset.tar.gz

import numpy as np
import pandas as pd
from auto_ml import Predictor

df_train = pd.read_csv('train_A.csv')
df_test = pd.read_csv('test_A.csv')
#y_train = train['y']1
#X_train = train.drop('y', axis=1)
#X_train = X_train
#X_test = test

column_descriptions = {
    'y': 'output',
    'ID': 'ignore',
    'X0': 'categorical',
    'X1': 'categorical',
    'X2': 'categorical',
    'X3': 'categorical',
    'X4': 'categorical',
    'X5': 'categorical',
    'X6': 'categorical',
    'X7': 'categorical',
    'X8': 'categorical'
}

ml_predictor = Predictor(type_of_estimator='regressor', column_descriptions=column_descriptions)

#ml_predictor.train(df_train)
ml_predictor.train(df_train, optimize_final_model=True, perform_feature_selection=True, 
                   take_log_of_y=False,
                   cv=2,
                   ml_for_analytics=False,
#                  #regressors
                   model_names=[
#                           'ARDRegression', #- !!!slow,
#                           'AdaBoostRegressor', 
#                           'BayesianRidge', 
#                           'ElasticNet', 
#                           'ExtraTreesRegressor',  
#                           'GradientBoostingRegressor', 
#                           'Lasso', 
#                           'LassoLars', 
#                           'LinearRegression',   
#                           'OrthogonalMatchingPursuit', 
#                           'PassiveAggressiveRegressor', 
#                           'RANSACRegressor', 
#                           'RandomForestRegressor', 
#                           'Ridge', 
#                           'SGDRegressor', 
#                            #nons-cikit models:
                           #'DeepLearningRegressor', #gpu support
                           #'LGBMRegressor', 
                             'XGBRegressor' #!!!problem
                           ]

#                   #classifiers       
#                   models=[ 'AdaBoostClassifier', 
#                           'ExtraTreesClassifier', 
#                           'GradientBoostingClassifier', 
#                           'LogisticRegression', 'MiniBatchKMeans', 
#                           'OrthogonalMatchingPursuit', 'PassiveAggressiveClassifier', 
#                           'Perceptron', 
#                           'RandomForestClassifier', 
#                           'RidgeClassifier', 'SGDClassifier',  #more_models
#                           'DeepLearningClassifier',  
#                           'LGBMClassifier','XGBClassifier']
                   )

predictions = ml_predictor.predict(df_test)
np.savetxt("file_name.csv", predictions, delimiter=",", fmt='%s', header='')