flennerhag / mlens

ML-Ensemble – high performance ensemble learning
http://ml-ensemble.com
MIT License
843 stars 108 forks source link

Bug? ``mlens.utils.check_instances`` returns wrong pipeline orders #122

Open lunluen opened 5 years ago

lunluen commented 5 years ago

If I understand it right, check_instances seems to tend to cluster transformers with same types without regard to the list order. So the results from make_group and BaseEnsemble are also wrong.

Reproduce - Case 1

Code

from mlens.utils import check_instances
from sklearn.preprocessing import (
    StandardScaler as SS, FunctionTransformer as FT)
from xgboost import XGBClassifier

clf = XGBClassifier(n_estimators=10)
def f1(): return 1
def f2(): return 2

estimators = [clf]
preprocessing = [FT(f1), SS(), FT(f2)]

check_instances(estimators, preprocessing)[0][0][1]

Result

[('functiontransformer-1',
  FunctionTransformer(accept_sparse=False, check_inverse=True,
                      func=<function f1 at 0x000002532FDB0C80>, inv_kw_args=None,
                      inverse_func=None, kw_args=None, pass_y='deprecated',
                      validate=None)),
 ('functiontransformer-2',
  FunctionTransformer(accept_sparse=False, check_inverse=True,
                      func=<function f2 at 0x000002532FDB0D90>, inv_kw_args=None,
                      inverse_func=None, kw_args=None, pass_y='deprecated',
                      validate=None)),
 ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True))]

Expected Behavior

The order should be [FT(f1), SS(), FT(f2)] rather than FT(f1), FT(f2), SS()


Reproduce - Case 2

Code

from mlens.utils import check_instances
from sklearn.preprocessing import (
    StandardScaler as SS, FunctionTransformer as FT)
from xgboost import XGBClassifier

clf = XGBClassifier(n_estimators=10)
def f1(): return 1
def f2(): return 2

estimators = {
    'key-1': [clone(clf)],
    'key-2': [clone(clf)],
}
preprocessing = {
    'key-1': [FT(f2), SS(), FT(f1)],
    'key-2': [SS(), FT(f2), FT(f1)],
}

check_instances(estimators, preprocessing)[0]

Result

[('key-1',
  [('functiontransformer-1',
    FunctionTransformer(accept_sparse=False, check_inverse=True,
                        func=<function f2 at 0x0000025336E1C840>, inv_kw_args=None,
                        inverse_func=None, kw_args=None, pass_y='deprecated',
                        validate=None)),
   ('functiontransformer-2',
    FunctionTransformer(accept_sparse=False, check_inverse=True,
                        func=<function f1 at 0x0000025336E1C9D8>, inv_kw_args=None,
                        inverse_func=None, kw_args=None, pass_y='deprecated',
                        validate=None)),
   ('standardscaler-1',
    StandardScaler(copy=True, with_mean=True, with_std=True))]),
 ('key-2',
  [('functiontransformer-3',
    FunctionTransformer(accept_sparse=False, check_inverse=True,
                        func=<function f2 at 0x0000025336E1C840>, inv_kw_args=None,
                        inverse_func=None, kw_args=None, pass_y='deprecated',
                        validate=None)),
   ('functiontransformer-4',
    FunctionTransformer(accept_sparse=False, check_inverse=True,
                        func=<function f1 at 0x0000025336E1C9D8>, inv_kw_args=None,
                        inverse_func=None, kw_args=None, pass_y='deprecated',
                        validate=None)),
   ('standardscaler-2',
    StandardScaler(copy=True, with_mean=True, with_std=True))])]

key1 should be [FT(f2), SS(), FT(f1)] rather than [FT(f2), FT(f1), SS()] key2 should be [FT(f2), SS(), FT(f1)] rather than [FT(f2), FT(f1), SS()]

lunluen commented 5 years ago

Sorry the correct last line is key2 should be [SS(), FT(f2), FT(f1)] rather than [FT(f2), FT(f1), SS()]

lunluen commented 5 years ago
>>> import mlens
>>> mlens.__version__
'0.2.3'
flennerhag commented 5 years ago

Hey! Thanks for flagging this, definitely looks like a bug. It's reordering based on name, the offending line seems to be the sorted() call applied to the output of _format_instances, on L99, which will sort the list of named preprocessors according to their names.

I'll try to fix this some time this week, but feel free to make a PR.

lunluen commented 5 years ago

Hi @flennerhag , Thanks for the reply. And... I found another bug here. :astonished:

Code

from mlens.utils import check_instances
from sklearn.preprocessing import (
    StandardScaler as SS, FunctionTransformer as FT)
from xgboost import XGBClassifier

clf = XGBClassifier(n_estimators=10)
def f1(): return 1
def f2(): return 2

s = SS()

estimators = {
    'key-1': [clone(clf)],
    'key-2': [clone(clf)],
}
preprocessing = {
    'key-1': [FT(f2), s, FT(f1)],
    'key-2': [s, FT(f2), FT(f1)],
}

check_instances(estimators, preprocessing)[0]

Result

[('key-1',
  [('functiontransformer-1',
    FunctionTransformer(accept_sparse=False, check_inverse=True,
                        func=<function f2 at 0x000001C9B3C389D8>, inv_kw_args=None,
                        inverse_func=None, kw_args=None, pass_y='deprecated',
                        validate=None)),
   ('functiontransformer-2',
    FunctionTransformer(accept_sparse=False, check_inverse=True,
                        func=<function f1 at 0x000001C9B3C38BF8>, inv_kw_args=None,
                        inverse_func=None, kw_args=None, pass_y='deprecated',
                        validate=None))]),
 ('key-2',
  [('functiontransformer-3',
    FunctionTransformer(accept_sparse=False, check_inverse=True,
                        func=<function f2 at 0x000001C9B3C389D8>, inv_kw_args=None,
                        inverse_func=None, kw_args=None, pass_y='deprecated',
                        validate=None)),
   ('functiontransformer-4',
    FunctionTransformer(accept_sparse=False, check_inverse=True,
                        func=<function f1 at 0x000001C9B3C38BF8>, inv_kw_args=None,
                        inverse_func=None, kw_args=None, pass_y='deprecated',
                        validate=None)),
   ('standardscaler-1',
    StandardScaler(copy=True, with_mean=True, with_std=True)),
   ('standardscaler-2',
    StandardScaler(copy=True, with_mean=True, with_std=True))])]

The two standard scaler (singleton) are grouped together.