如何获取ensemble后的最好模型参数

kilig000123 commented 4 months ago

使用hypergbm后，我的数据集在一些指标上有了大幅提升，但是我想知道在experiment跑完后聚合出来的最优结果的具体模型，想知道其的具体使用到了什么模型以及详细参数，我目前只找到了聚合后的weight与score。如果有方法请告诉我，这对我理解模型十分重要

lixfz commented 4 months ago

如果你已经能够找到weight与score的话，具体每个模型的信息也就很容易了。

可以参考以下示例：

#estimator = = experiment.run()
ensembled = estimator.steps[-1][-1]
weights= ensembled.weights_
models = ensembled.estimators

for i,(w,m) in enumerate(zip(weights,models)):
    if m is not None:
        print('-'*30)
        print(i, w, m)

输出如下：

------------------------------
0 0.55 HyperGBMEstimator(task=binary, reward_metric=precision, cv=True,
data_pipeline: DataFrameMapper(df_out=True,
                df_out_dtype_transforms=[(ColumnSelector(include:['object', 'string']),
                                          'int')],
                features=[(ColumnSelector(include:['object', 'string', 'category', 'bool']),
                           Pipeline(steps=[('categorical_imputer_0',
                                            SafeSimpleImputer(strategy='constant')),
                                           ('categorical_label_encoder_0',
                                            MultiLabelEncoder())])),
                          (ColumnSelector(include:number, exclude:timedelta),
                           Pipeline(steps=[('numeric_imputer_0',
                                            FloatOutputImputer(strategy='median')),
                                           ('numeric_log_standard_scaler_0',
                                            LogStandardScaler())]))],
                input_df=True)
gbm_model: CatBoostClassifierWrapper(learning_rate=0.5, depth=10, l2_leaf_reg=20, silent=True, n_estimators=200, random_state=55954, eval_metric='Precision')
)
------------------------------
4 0.4 HyperGBMEstimator(task=binary, reward_metric=precision, cv=True,
data_pipeline: DataFrameMapper(df_out=True,
                df_out_dtype_transforms=[(ColumnSelector(include:['object', 'string']),
                                          'int')],
                features=[(ColumnSelector(include:['object', 'string', 'category', 'bool']),
                           Pipeline(steps=[('categorical_imputer_0',
                                            SafeSimpleImputer(strategy='constant')),
                                           ('categorical_label_encoder_0',
                                            MultiLabelEncoder())])),
                          (ColumnSelector(include:number, exclude:timedelta),
                           Pipeline(steps=[('numeric_imputer_0',
                                            FloatOutputImputer(strategy='median')),
                                           ('numeric_robust_scaler_0',
                                            RobustScaler())]))],
                input_df=True)
gbm_model: LGBMClassifierWrapper(boosting_type='goss', early_stopping_rounds=10,
                      learning_rate=0.5, max_depth=5, n_estimators=200,
                      num_leaves=440, random_state=58258, reg_alpha=10,
                      reg_lambda=0.5, verbosity=-1)
)
------------------------------
...

kilig000123 commented 4 months ago

我在这样的信息里面看到了categorical_label_encoder_0，这种类别编码方式具体是什么呢？

lixfz commented 4 months ago

HyperGBM实现的是从预处理到模型训练的全链路优化，categorical_label_encoder_0是对categorical数据进行的预处理，以我上面示例的输出为例：

此时对categorical进行编码采用的是MultiLabelEncoder，MultiLabelEncoder来自于Hypernet对LabelEncoder的封装。

关于HyperGBM缺省的优化空间的详细定义可参考源代码 search_space.py 和 sklearn_ops.py

DataCanvasIO / HyperGBM

如何获取ensemble后的最好模型参数 #106