ContinuumIO / elm

Phase I & part of Phase II of NASA SBIR - Parallel Machine Learning on Satellite Data
http://ensemble-learning-models.readthedocs.io
44 stars 24 forks source link

Better repr / str for elm.pipeline.steps wrapped classes? #224

Open PeterDSteinberg opened 6 years ago

PeterDSteinberg commented 6 years ago

What can be done to better wrap the elm.pipeline.steps classes for appearance, repr - str?

Currently this is a repr of a Pipeline from PR #221 (run from the elm/tests directory):

$ ipython -i test_xarray_cross_validation.py
Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:14:59)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: pca_regress
Out[1]:
Pipeline(memory=None,
     steps=[('get_y', GetY(layer='y')), ('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('estimator', Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001))])

Looks like sklearn's Pipeline.__repr__ (see pca_regress._cls - here that is sklearn.pipeline.Pipeline. Some methods like repr are delegated to calling pca_regress._cls with self as an argument). Should we add a note to the repr about elm.pipeline.steps / MLDataset?

PeterDSteinberg commented 6 years ago

Also note that using the delegation pattern mentioned above causes some of the scikit-learn exception strings to be not as informative as possible, because they look for the string self.__class__.__name__ rather than self._cls.__name__. This causes the unclarity in exception string below (if using a Pipeline, and getting an error about Wrapped (one of its steps), it may be hard to tell which transformer/estimator is having a problem):

self = Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)
params = {'copy_x': False, 'fit_intercept': False, 'normalize': False}
valid_params = {'alpha': 1.0, 'copy_X': True, 'fit_intercept': True, 'max_iter': None, ...}, key = 'copy_x', value = False, split = ['copy_x']

    def set_params(self, **params):
        """Set the parameters of this estimator.

            The method works on simple estimators as well as on nested objects
            (such as pipelines). The latter have parameters of the form
            ``<component>__<parameter>`` so that it's possible to update each
            component of a nested object.

            Returns
            -------
            self
            """
        if not params:
            # Simple optimization to gain speed (inspect is slow)
            return self
        valid_params = self.get_params(deep=True)
        for key, value in six.iteritems(params):
            split = key.split('__', 1)
            if len(split) > 1:
                # nested objects case
                name, sub_name = split
                if name not in valid_params:
                    raise ValueError('Invalid parameter %s for estimator %s. '
                                     'Check the list of available parameters '
                                     'with `estimator.get_params().keys()`.' %
                                     (name, self))
                sub_object = valid_params[name]
                sub_object.set_params(**{sub_name: value})
            else:
                # simple objects case
                if key not in valid_params:
                    raise ValueError('Invalid parameter %s for estimator %s. '
                                     'Check the list of available parameters '
                                     'with `estimator.get_params().keys()`.' %
>                                    (key, self.__class__.__name__))
E                   ValueError: Invalid parameter copy_x for estimator Wrapped. Check the list of available parameters with `estimator.get_params().keys()`.