This was the script that resulted in an unusually low R2 -- the expected result was higher than multiple regression (.55) instead of R2 of .12. The question is whether this indicates a poor use case for auto-SKLearn, a need for parameter or hyperparameter adjustments, or some other error in use?
15:11:48 PRIVATE python3 eluellen-sklearn.py
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.metrics.classification module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API.
warnings.warn(message, FutureWarning)
Samples = 2619, Features = 40
X_train = [[4.96200e+04 1.71090e+04 3.44800e-01 ... 1.70000e+04 3.26200e+04
6.57400e-01]
[5.95000e+04 0.00000e+00 0.00000e+00 ... 5.95000e+04 0.00000e+00
0.00000e+00]
[4.65400e+04 4.65400e+04 1.00000e+00 ... 1.15400e+04 3.50000e+04
7.52000e-01]
...
[5.25800e+04 3.14100e+04 5.97400e-01 ... 2.25800e+04 3.00000e+04
5.70600e-01]
[6.46150e+04 6.27120e+04 9.70500e-01 ... 9.61500e+03 5.50000e+04
8.51200e-01]
[5.25800e+04 2.90390e+04 5.52300e-01 ... 2.22230e+04 3.03575e+04
5.77400e-01]], y_train = [1. 0. 1. ... 1. 1. 1.]
/usr/local/lib/python3.6/dist-packages/sklearn/base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
FutureWarning)
[WARNING] [2020-02-18 15:11:58,185:AutoMLSMBO(1)::cb28bbd020a0a08a3c17168f19c8aaae] Could not find meta-data directory /usr/local/lib/python3.6/dist-packages/autosklearn/metalearning/files/r2_regression_dense
[WARNING] [2020-02-18 15:11:58,212:EnsembleBuilder(1):cb28bbd020a0a08a3c17168f19c8aaae] No models better than random - using Dummy Score!
[WARNING] [2020-02-18 15:11:58,224:EnsembleBuilder(1):cb28bbd020a0a08a3c17168f19c8aaae] No models better than random - using Dummy Score!
[WARNING] [2020-02-18 15:12:00,228:EnsembleBuilder(1):cb28bbd020a0a08a3c17168f19c8aaae] No models better than random - using Dummy Score!
[(0.340000, SimpleRegressionPipeline({'categorical_encoding:choice': 'one_hot_encoding', 'imputation:strategy': 'median', 'preprocessor:choice': 'extra_trees_preproc_for_regression', 'regressor:choice': 'ridge_regression', 'rescaling:choice': 'quantile_transformer', 'categorical_encoding:one_hot_encoding:use_minimum_fraction': 'True', 'preprocessor:extra_trees_preproc_for_regression:bootstrap': 'True', 'preprocessor:extra_trees_preproc_for_regression:criterion': 'mae', 'preprocessor:extra_trees_preproc_for_regression:max_depth': 'None', 'preprocessor:extra_trees_preproc_for_regression:max_features': 0.8215479502881777, 'preprocessor:extra_trees_preproc_for_regression:max_leaf_nodes': 'None', 'preprocessor:extra_trees_preproc_for_regression:min_samples_leaf': 11, 'preprocessor:extra_trees_preproc_for_regression:min_samples_split': 9, 'preprocessor:extra_trees_preproc_for_regression:min_weight_fraction_leaf': 0.0, 'preprocessor:extra_trees_preproc_for_regression:n_estimators': 100, 'regressor:ridge_regression:alpha': 4.563743442447699, 'regressor:ridge_regression:fit_intercept': 'True', 'regressor:ridge_regression:tol': 4.8339309027613326e-05, 'rescaling:quantile_transformer:n_quantiles': 572, 'rescaling:quantile_transformer:output_distribution': 'uniform', 'categorical_encoding:one_hot_encoding:minimum_fraction': 0.022216999044307732},
dataset_properties={
'task': 4,
'sparse': False,
'multilabel': False,
'multiclass': False,
'target_type': 'regression',
'signed': False})),
(0.340000, SimpleRegressionPipeline({'categorical_encoding:choice': 'one_hot_encoding', 'imputation:strategy': 'most_frequent', 'preprocessor:choice': 'fast_ica', 'regressor:choice': 'extra_trees', 'rescaling:choice': 'minmax', 'categorical_encoding:one_hot_encoding:use_minimum_fraction': 'False', 'preprocessor:fast_ica:algorithm': 'parallel', 'preprocessor:fast_ica:fun': 'logcosh', 'preprocessor:fast_ica:whiten': 'False', 'regressor:extra_trees:bootstrap': 'False', 'regressor:extra_trees:criterion': 'friedman_mse', 'regressor:extra_trees:max_depth': 'None', 'regressor:extra_trees:max_features': 0.343851332296278, 'regressor:extra_trees:max_leaf_nodes': 'None', 'regressor:extra_trees:min_impurity_decrease': 0.0, 'regressor:extra_trees:min_samples_leaf': 14, 'regressor:extra_trees:min_samples_split': 5, 'regressor:extra_trees:n_estimators': 100},
dataset_properties={
'task': 4,
'sparse': False,
'multilabel': False,
'multiclass': False,
'target_type': 'regression',
'signed': False})),
(0.260000, SimpleRegressionPipeline({'categorical_encoding:choice': 'one_hot_encoding', 'imputation:strategy': 'mean', 'preprocessor:choice': 'no_preprocessing', 'regressor:choice': 'random_forest', 'rescaling:choice': 'standardize', 'categorical_encoding:one_hot_encoding:use_minimum_fraction': 'True', 'regressor:random_forest:bootstrap': 'True', 'regressor:random_forest:criterion': 'mse', 'regressor:random_forest:max_depth': 'None', 'regressor:random_forest:max_features': 1.0, 'regressor:random_forest:max_leaf_nodes': 'None', 'regressor:random_forest:min_impurity_decrease': 0.0, 'regressor:random_forest:min_samples_leaf': 1, 'regressor:random_forest:min_samples_split': 2, 'regressor:random_forest:min_weight_fraction_leaf': 0.0, 'regressor:random_forest:n_estimators': 100, 'categorical_encoding:one_hot_encoding:minimum_fraction': 0.01},
dataset_properties={
'task': 4,
'sparse': False,
'multilabel': False,
'multiclass': False,
'target_type': 'regression',
'signed': False})),
(0.040000, SimpleRegressionPipeline({'categorical_encoding:choice': 'one_hot_encoding', 'imputation:strategy': 'most_frequent', 'preprocessor:choice': 'fast_ica', 'regressor:choice': 'ridge_regression', 'rescaling:choice': 'standardize', 'categorical_encoding:one_hot_encoding:use_minimum_fraction': 'True', 'preprocessor:fast_ica:algorithm': 'deflation', 'preprocessor:fast_ica:fun': 'exp', 'preprocessor:fast_ica:whiten': 'True', 'regressor:ridge_regression:alpha': 1.3608642297867532e-05, 'regressor:ridge_regression:fit_intercept': 'True', 'regressor:ridge_regression:tol': 0.002596874543719601, 'categorical_encoding:one_hot_encoding:minimum_fraction': 0.00017348437847697216, 'preprocessor:fast_ica:n_components': 1058},
dataset_properties={
'task': 4,
'sparse': False,
'multilabel': False,
'multiclass': False,
'target_type': 'regression',
'signed': False})),
(0.020000, SimpleRegressionPipeline({'categorical_encoding:choice': 'no_encoding', 'imputation:strategy': 'median', 'preprocessor:choice': 'select_percentile_regression', 'regressor:choice': 'ridge_regression', 'rescaling:choice': 'quantile_transformer', 'preprocessor:select_percentile_regression:percentile': 82.56436225708288, 'preprocessor:select_percentile_regression:score_func': 'mutual_info', 'regressor:ridge_regression:alpha': 1.6259354959848533, 'regressor:ridge_regression:fit_intercept': 'True', 'regressor:ridge_regression:tol': 0.005858793476627702, 'rescaling:quantile_transformer:n_quantiles': 431, 'rescaling:quantile_transformer:output_distribution': 'normal'},
dataset_properties={
'task': 4,
'sparse': False,
'multilabel': False,
'multiclass': False,
'target_type': 'regression',
'signed': False})),
]
R2 score: 0.12086525801756198
This was the script that resulted in an unusually low R2 -- the expected result was higher than multiple regression (.55) instead of R2 of .12. The question is whether this indicates a poor use case for auto-SKLearn, a need for parameter or hyperparameter adjustments, or some other error in use?
15:11:48 PRIVATE python3 eluellen-sklearn.py /usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.metrics.classification module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API. warnings.warn(message, FutureWarning) Samples = 2619, Features = 40 X_train = [[4.96200e+04 1.71090e+04 3.44800e-01 ... 1.70000e+04 3.26200e+04 6.57400e-01] [5.95000e+04 0.00000e+00 0.00000e+00 ... 5.95000e+04 0.00000e+00 0.00000e+00] [4.65400e+04 4.65400e+04 1.00000e+00 ... 1.15400e+04 3.50000e+04 7.52000e-01] ... [5.25800e+04 3.14100e+04 5.97400e-01 ... 2.25800e+04 3.00000e+04 5.70600e-01] [6.46150e+04 6.27120e+04 9.70500e-01 ... 9.61500e+03 5.50000e+04 8.51200e-01] [5.25800e+04 2.90390e+04 5.52300e-01 ... 2.22230e+04 3.03575e+04 5.77400e-01]], y_train = [1. 0. 1. ... 1. 1. 1.] /usr/local/lib/python3.6/dist-packages/sklearn/base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None. FutureWarning) [WARNING] [2020-02-18 15:11:58,185:AutoMLSMBO(1)::cb28bbd020a0a08a3c17168f19c8aaae] Could not find meta-data directory /usr/local/lib/python3.6/dist-packages/autosklearn/metalearning/files/r2_regression_dense [WARNING] [2020-02-18 15:11:58,212:EnsembleBuilder(1):cb28bbd020a0a08a3c17168f19c8aaae] No models better than random - using Dummy Score! [WARNING] [2020-02-18 15:11:58,224:EnsembleBuilder(1):cb28bbd020a0a08a3c17168f19c8aaae] No models better than random - using Dummy Score! [WARNING] [2020-02-18 15:12:00,228:EnsembleBuilder(1):cb28bbd020a0a08a3c17168f19c8aaae] No models better than random - using Dummy Score! [(0.340000, SimpleRegressionPipeline({'categorical_encoding:choice': 'one_hot_encoding', 'imputation:strategy': 'median', 'preprocessor:choice': 'extra_trees_preproc_for_regression', 'regressor:choice': 'ridge_regression', 'rescaling:choice': 'quantile_transformer', 'categorical_encoding:one_hot_encoding:use_minimum_fraction': 'True', 'preprocessor:extra_trees_preproc_for_regression:bootstrap': 'True', 'preprocessor:extra_trees_preproc_for_regression:criterion': 'mae', 'preprocessor:extra_trees_preproc_for_regression:max_depth': 'None', 'preprocessor:extra_trees_preproc_for_regression:max_features': 0.8215479502881777, 'preprocessor:extra_trees_preproc_for_regression:max_leaf_nodes': 'None', 'preprocessor:extra_trees_preproc_for_regression:min_samples_leaf': 11, 'preprocessor:extra_trees_preproc_for_regression:min_samples_split': 9, 'preprocessor:extra_trees_preproc_for_regression:min_weight_fraction_leaf': 0.0, 'preprocessor:extra_trees_preproc_for_regression:n_estimators': 100, 'regressor:ridge_regression:alpha': 4.563743442447699, 'regressor:ridge_regression:fit_intercept': 'True', 'regressor:ridge_regression:tol': 4.8339309027613326e-05, 'rescaling:quantile_transformer:n_quantiles': 572, 'rescaling:quantile_transformer:output_distribution': 'uniform', 'categorical_encoding:one_hot_encoding:minimum_fraction': 0.022216999044307732}, dataset_properties={ 'task': 4, 'sparse': False, 'multilabel': False, 'multiclass': False, 'target_type': 'regression', 'signed': False})), (0.340000, SimpleRegressionPipeline({'categorical_encoding:choice': 'one_hot_encoding', 'imputation:strategy': 'most_frequent', 'preprocessor:choice': 'fast_ica', 'regressor:choice': 'extra_trees', 'rescaling:choice': 'minmax', 'categorical_encoding:one_hot_encoding:use_minimum_fraction': 'False', 'preprocessor:fast_ica:algorithm': 'parallel', 'preprocessor:fast_ica:fun': 'logcosh', 'preprocessor:fast_ica:whiten': 'False', 'regressor:extra_trees:bootstrap': 'False', 'regressor:extra_trees:criterion': 'friedman_mse', 'regressor:extra_trees:max_depth': 'None', 'regressor:extra_trees:max_features': 0.343851332296278, 'regressor:extra_trees:max_leaf_nodes': 'None', 'regressor:extra_trees:min_impurity_decrease': 0.0, 'regressor:extra_trees:min_samples_leaf': 14, 'regressor:extra_trees:min_samples_split': 5, 'regressor:extra_trees:n_estimators': 100}, dataset_properties={ 'task': 4, 'sparse': False, 'multilabel': False, 'multiclass': False, 'target_type': 'regression', 'signed': False})), (0.260000, SimpleRegressionPipeline({'categorical_encoding:choice': 'one_hot_encoding', 'imputation:strategy': 'mean', 'preprocessor:choice': 'no_preprocessing', 'regressor:choice': 'random_forest', 'rescaling:choice': 'standardize', 'categorical_encoding:one_hot_encoding:use_minimum_fraction': 'True', 'regressor:random_forest:bootstrap': 'True', 'regressor:random_forest:criterion': 'mse', 'regressor:random_forest:max_depth': 'None', 'regressor:random_forest:max_features': 1.0, 'regressor:random_forest:max_leaf_nodes': 'None', 'regressor:random_forest:min_impurity_decrease': 0.0, 'regressor:random_forest:min_samples_leaf': 1, 'regressor:random_forest:min_samples_split': 2, 'regressor:random_forest:min_weight_fraction_leaf': 0.0, 'regressor:random_forest:n_estimators': 100, 'categorical_encoding:one_hot_encoding:minimum_fraction': 0.01}, dataset_properties={ 'task': 4, 'sparse': False, 'multilabel': False, 'multiclass': False, 'target_type': 'regression', 'signed': False})), (0.040000, SimpleRegressionPipeline({'categorical_encoding:choice': 'one_hot_encoding', 'imputation:strategy': 'most_frequent', 'preprocessor:choice': 'fast_ica', 'regressor:choice': 'ridge_regression', 'rescaling:choice': 'standardize', 'categorical_encoding:one_hot_encoding:use_minimum_fraction': 'True', 'preprocessor:fast_ica:algorithm': 'deflation', 'preprocessor:fast_ica:fun': 'exp', 'preprocessor:fast_ica:whiten': 'True', 'regressor:ridge_regression:alpha': 1.3608642297867532e-05, 'regressor:ridge_regression:fit_intercept': 'True', 'regressor:ridge_regression:tol': 0.002596874543719601, 'categorical_encoding:one_hot_encoding:minimum_fraction': 0.00017348437847697216, 'preprocessor:fast_ica:n_components': 1058}, dataset_properties={ 'task': 4, 'sparse': False, 'multilabel': False, 'multiclass': False, 'target_type': 'regression', 'signed': False})), (0.020000, SimpleRegressionPipeline({'categorical_encoding:choice': 'no_encoding', 'imputation:strategy': 'median', 'preprocessor:choice': 'select_percentile_regression', 'regressor:choice': 'ridge_regression', 'rescaling:choice': 'quantile_transformer', 'preprocessor:select_percentile_regression:percentile': 82.56436225708288, 'preprocessor:select_percentile_regression:score_func': 'mutual_info', 'regressor:ridge_regression:alpha': 1.6259354959848533, 'regressor:ridge_regression:fit_intercept': 'True', 'regressor:ridge_regression:tol': 0.005858793476627702, 'rescaling:quantile_transformer:n_quantiles': 431, 'rescaling:quantile_transformer:output_distribution': 'normal'}, dataset_properties={ 'task': 4, 'sparse': False, 'multilabel': False, 'multiclass': False, 'target_type': 'regression', 'signed': False})), ] R2 score: 0.12086525801756198
real 1m58.008s user 2m17.253s sys 0m12.919s