DataCanvasIO / HyperGBM

A full pipeline AutoML tool for tabular data
https://hypergbm.readthedocs.io/
Apache License 2.0
329 stars 46 forks source link

如何获取ensemble后的最好模型参数 #106

Open kilig000123 opened 4 months ago

kilig000123 commented 4 months ago

使用hypergbm后,我的数据集在一些指标上有了大幅提升,但是我想知道在experiment跑完后聚合出来的最优结果的具体模型,想知道其的具体使用到了什么模型以及详细参数,我目前只找到了聚合后的weight与score。 如果有方法请告诉我,这对我理解模型十分重要

lixfz commented 4 months ago

如果你已经能够找到weight与score的话,具体每个模型的信息也就很容易了。

可以参考以下示例:

#estimator = = experiment.run()
ensembled = estimator.steps[-1][-1]
weights= ensembled.weights_
models = ensembled.estimators

for i,(w,m) in enumerate(zip(weights,models)):
    if m is not None:
        print('-'*30)
        print(i, w, m)

输出如下:

------------------------------
0 0.55 HyperGBMEstimator(task=binary, reward_metric=precision, cv=True,
data_pipeline: DataFrameMapper(df_out=True,
                df_out_dtype_transforms=[(ColumnSelector(include:['object', 'string']),
                                          'int')],
                features=[(ColumnSelector(include:['object', 'string', 'category', 'bool']),
                           Pipeline(steps=[('categorical_imputer_0',
                                            SafeSimpleImputer(strategy='constant')),
                                           ('categorical_label_encoder_0',
                                            MultiLabelEncoder())])),
                          (ColumnSelector(include:number, exclude:timedelta),
                           Pipeline(steps=[('numeric_imputer_0',
                                            FloatOutputImputer(strategy='median')),
                                           ('numeric_log_standard_scaler_0',
                                            LogStandardScaler())]))],
                input_df=True)
gbm_model: CatBoostClassifierWrapper(learning_rate=0.5, depth=10, l2_leaf_reg=20, silent=True, n_estimators=200, random_state=55954, eval_metric='Precision')
)
------------------------------
4 0.4 HyperGBMEstimator(task=binary, reward_metric=precision, cv=True,
data_pipeline: DataFrameMapper(df_out=True,
                df_out_dtype_transforms=[(ColumnSelector(include:['object', 'string']),
                                          'int')],
                features=[(ColumnSelector(include:['object', 'string', 'category', 'bool']),
                           Pipeline(steps=[('categorical_imputer_0',
                                            SafeSimpleImputer(strategy='constant')),
                                           ('categorical_label_encoder_0',
                                            MultiLabelEncoder())])),
                          (ColumnSelector(include:number, exclude:timedelta),
                           Pipeline(steps=[('numeric_imputer_0',
                                            FloatOutputImputer(strategy='median')),
                                           ('numeric_robust_scaler_0',
                                            RobustScaler())]))],
                input_df=True)
gbm_model: LGBMClassifierWrapper(boosting_type='goss', early_stopping_rounds=10,
                      learning_rate=0.5, max_depth=5, n_estimators=200,
                      num_leaves=440, random_state=58258, reg_alpha=10,
                      reg_lambda=0.5, verbosity=-1)
)
------------------------------
...
kilig000123 commented 4 months ago

我在这样的信息里面看到了categorical_label_encoder_0,这种类别编码方式具体是什么呢?

lixfz commented 4 months ago

HyperGBM实现的是从预处理到模型训练的全链路优化,categorical_label_encoder_0是对categorical数据进行的预处理,以我上面示例的输出为例:

image

此时对categorical进行编码采用的是MultiLabelEncoder,MultiLabelEncoder来自于Hypernet对LabelEncoder的封装。

关于HyperGBM缺省的优化空间的详细定义可参考源代码 search_space.pysklearn_ops.py