bmurauer / pipelinehelper

scikit-helper to hot-swap pipeline elements
GNU General Public License v3.0
21 stars 9 forks source link

unable to call on grid.best_estimator_.get_params() parameters #11

Closed browshanravan closed 4 years ago

browshanravan commented 4 years ago

Apologies, but I wasn't sure if this qualified as a new issue or not.

So, when I use my example code and I put the output of grid.best_estimator_.get_params() into validation_curve() method and set the param_name, I get one of the following errors, whatever combination I try.

Code

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import validation_curve

n_estimators_range = np.linspace(1e-09, 1e-08, 10).astype('int')

train_scores, test_scores = validation_curve(
    grid.best_estimator_,
    X, y,
    param_name = 'clf__selected_model__var_smoothing',
    param_range = n_estimators_range,
    cv=5,
    scoring='roc_auc'
)

Error output for param_name = 'clf__selected_model__var_smoothing',

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-41-5dbe40e5195c> in <module>
     14     param_range = n_estimators_range,
     15     cv=5,
---> 16     scoring='roc_auc'
     17 )

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     71                           FutureWarning)
     72         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 73         return f(**kwargs)
     74     return inner_f
     75 

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/sklearn/model_selection/_validation.py in validation_curve(estimator, X, y, param_name, param_range, groups, cv, scoring, n_jobs, pre_dispatch, verbose, error_score)
   1500         error_score=error_score)
   1501         # NOTE do not change order of iteration to allow one time cv splitters
-> 1502         for train, test in cv.split(X, y, groups) for v in param_range)
   1503     out = np.asarray(out)
   1504     n_params = len(param_range)

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
   1027             # remaining jobs.
   1028             self._iterating = False
-> 1029             if self.dispatch_one_batch(iterator):
   1030                 self._iterating = self._original_iterator is not None
   1031 

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    845                 return False
    846             else:
--> 847                 self._dispatch(tasks)
    848                 return True
    849 

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/joblib/parallel.py in _dispatch(self, batch)
    763         with self._lock:
    764             job_idx = len(self._jobs)
--> 765             job = self._backend.apply_async(batch, callback=cb)
    766             # A job can complete so quickly than its callback is
    767             # called before we get here, causing self._jobs to

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    204     def apply_async(self, func, callback=None):
    205         """Schedule a func to be run"""
--> 206         result = ImmediateResult(func)
    207         if callback:
    208             callback(result)

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    568         # Don't delay the application, to avoid keeping the input
    569         # arguments in memory
--> 570         self.results = batch()
    571 
    572     def get(self):

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/joblib/parallel.py in __call__(self)
    251         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    252             return [func(*args, **kwargs)
--> 253                     for func, args, kwargs in self.items]
    254 
    255     def __reduce__(self):

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/joblib/parallel.py in <listcomp>(.0)
    251         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    252             return [func(*args, **kwargs)
--> 253                     for func, args, kwargs in self.items]
    254 
    255     def __reduce__(self):

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
    518             cloned_parameters[k] = clone(v, safe=False)
    519 
--> 520         estimator = estimator.set_params(**cloned_parameters)
    521 
    522     start_time = time.time()

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/sklearn/pipeline.py in set_params(self, **kwargs)
    139         self
    140         """
--> 141         self._set_params('steps', **kwargs)
    142         return self
    143 

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in _set_params(self, attr, **params)
     51                 self._replace_estimator(attr, name, params.pop(name))
     52         # 3. Step parameters and other initialisation arguments
---> 53         super().set_params(**params)
     54         return self
     55 

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/sklearn/base.py in set_params(self, **params)
    259 
    260         for key, sub_params in nested_params.items():
--> 261             valid_params[key].set_params(**sub_params)
    262 
    263         return self

TypeError: set_params() got an unexpected keyword argument 'selected_model__var_smoothing'

Error output for param_name = 'selected_model__var_smoothing',

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-42-4cb6f09232cb> in <module>
     14     param_range = n_estimators_range,
     15     cv=5,
---> 16     scoring='roc_auc'
     17 )

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     71                           FutureWarning)
     72         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 73         return f(**kwargs)
     74     return inner_f
     75 

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/sklearn/model_selection/_validation.py in validation_curve(estimator, X, y, param_name, param_range, groups, cv, scoring, n_jobs, pre_dispatch, verbose, error_score)
   1500         error_score=error_score)
   1501         # NOTE do not change order of iteration to allow one time cv splitters
-> 1502         for train, test in cv.split(X, y, groups) for v in param_range)
   1503     out = np.asarray(out)
   1504     n_params = len(param_range)

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
   1027             # remaining jobs.
   1028             self._iterating = False
-> 1029             if self.dispatch_one_batch(iterator):
   1030                 self._iterating = self._original_iterator is not None
   1031 

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    845                 return False
    846             else:
--> 847                 self._dispatch(tasks)
    848                 return True
    849 

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/joblib/parallel.py in _dispatch(self, batch)
    763         with self._lock:
    764             job_idx = len(self._jobs)
--> 765             job = self._backend.apply_async(batch, callback=cb)
    766             # A job can complete so quickly than its callback is
    767             # called before we get here, causing self._jobs to

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    204     def apply_async(self, func, callback=None):
    205         """Schedule a func to be run"""
--> 206         result = ImmediateResult(func)
    207         if callback:
    208             callback(result)

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    568         # Don't delay the application, to avoid keeping the input
    569         # arguments in memory
--> 570         self.results = batch()
    571 
    572     def get(self):

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/joblib/parallel.py in __call__(self)
    251         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    252             return [func(*args, **kwargs)
--> 253                     for func, args, kwargs in self.items]
    254 
    255     def __reduce__(self):

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/joblib/parallel.py in <listcomp>(.0)
    251         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    252             return [func(*args, **kwargs)
--> 253                     for func, args, kwargs in self.items]
    254 
    255     def __reduce__(self):

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
    518             cloned_parameters[k] = clone(v, safe=False)
    519 
--> 520         estimator = estimator.set_params(**cloned_parameters)
    521 
    522     start_time = time.time()

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/sklearn/pipeline.py in set_params(self, **kwargs)
    139         self
    140         """
--> 141         self._set_params('steps', **kwargs)
    142         return self
    143 

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in _set_params(self, attr, **params)
     51                 self._replace_estimator(attr, name, params.pop(name))
     52         # 3. Step parameters and other initialisation arguments
---> 53         super().set_params(**params)
     54         return self
     55 

~/PycharmProjects/Data School/DS_Pandas_tut/venv/lib/python3.7/site-packages/sklearn/base.py in set_params(self, **params)
    250                                  'Check the list of available parameters '
    251                                  'with `estimator.get_params().keys()`.' %
--> 252                                  (key, self))
    253 
    254             if delim:

ValueError: Invalid parameter selected_model for estimator Pipeline(steps=[('preprosessor',
                 ColumnTransformer(transformers=[('N_Fimp',
                                                  Pipeline(steps=[('num_imputer',
                                                                   SimpleImputer())]),
                                                  ['Age']),
                                                 ('C_Fimp',
                                                  Pipeline(steps=[('cat_imputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OneHotEncoder())]),
                                                  ['Sex', 'Embarked'])])),
                ('clf',
                 PipelineHelper(available_models={'ExtraTreesClassifier': ExtraTreesClassifier(n_jobs=-1,
                                                                                               random_state=42),
                                                  'GaussianNB': GaussianNB(),
                                                  'RandomForestClassifier': RandomForestClassifier(n_jobs=-1,
                                                                                                   random_state=42)},
                                selected_model=GaussianNB()))]). Check the list of available parameters with `estimator.get_params().keys()`.
bmurauer commented 4 years ago

Unfortunately, it would be non-trivial to implement the desired behavior in the current Pipelinehelper. I suggest that you add the parameter var_smoothing to the grid search parameters and use the cv_results field of the grid object for detailed numbers. The validation_curve method is more like a scaled-down version of grid search, and I'm not sure why you want to perform it separately afterwards. Otherwise, for using validation_curve the way you describe, you must know which model is set in the best_estimator_ field (otherwise, you can't know which parameter to test). This however means that you don't need the Pipelinehelper any longer.

browshanravan commented 4 years ago

Thats fair enough. I was thinking more along the lines of plotting how a change in a particular parameter influences the validation curve, given all other parameters being optimised, but it is not critical to my pipeline. Many thanks.

bmurauer commented 4 years ago

I think that the key point is that all other parameters might no longer be optimal once you change a different parameter. Otherwise, one wouldn't need a grid search but optimize all parameters one after another.

browshanravan commented 4 years ago

Agreed! great point :) 👍