fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models
Other
574 stars 46 forks source link

Get optimal parameters of trained gpboost.basic.Booster #27

Closed poroc300 closed 3 years ago

poroc300 commented 3 years ago

First of all thank you for all the nice improvements added in the last package update.

I have performed a grid search optimization approach to determine optimal parameters for analysis. After I have found them, I trained a model with those parameters and saved the model to a json file. When I load the model, however, I cannot get the parameters used to train the model. Not sure if I am using the wrong attributes, and I also know this is just a minor issue. I can save independently another file with a list of parameters but just thought it would be handier to access them through the loaded model itself.

Please find below a snippet of code to replicate my problem. I am using Spyder 5.0.0 with Python 3.8 on Windows 10. Thank you very much.


import os
import numpy as np
import gpboost as gpb
from sklearn.model_selection import KFold

np.random.seed(42)

#--------------------------- Simulated data ---------------------------------------------------
#same simulated dataset used in the tutorials of this package
def f1d(x):
    """Non-linear function for simulation"""
    return (1.7 * (1 / (1 + np.exp(-(x - 0.5) * 20)) + 0.75 * x))

n = 5000
m = 500  
group = np.arange(n)  # grouping variable
for i in range(m):
    group[int(i * n / m):int((i + 1) * n / m)] = i
b1 = np.random.normal(size=m)  
eps = b1[group]
X = np.random.rand(n, 2)
f = f1d(X[:, 0])
xi = np.sqrt(0.01) * np.random.normal(size=n) 
y = f + eps + xi  # observed data
#----------------------------------------------------------------------------------------------

#learning parameters to be tested
learn_params = {'learning_rate': 0.05,
                'max_depth': 6,
                'min_data_in_leaf': [5, 10, 15],
                'max_bin': [50, 100]} 

#core parameters
core_params = {'objective': 'regression_l2', 'num_leaves': 50} 

#input data
kfold = KFold(n_splits=5, random_state=42, shuffle=True)
gpb_data = gpb.Dataset(X, y)
gpb_model = gpb.GPModel(group_data=group).set_optim_params(params={"optimizer_cov": "gradient_descent"})

#perform grid search
opt_params = gpb.grid_search_tune_parameters(param_grid=learn_params,
                                             params=core_params,
                                             num_try_random=None,
                                             folds=kfold,
                                             gp_model=gpb_model,
                                             use_gp_model_for_validation=True,
                                             train_set=gpb_data,
                                             num_boost_round=1000, 
                                             metrics='root_mean_squared_error')

#opt_params results
# {'best_params': {'learning_rate': 0.05,
#   'max_depth': 6,
#   'min_data_in_leaf': 15,
#   'max_bin': 50},
#   'best_iter': 59,
#   'best_score': 1.0046612116531306}

#concatenate optimal params to train a model
gpb_params = dict()
ls_dict = [opt_params["best_params"], core_params]
for dict_ in ls_dict:
    gpb_params.update(dict_)
gpb_params = dict(gpb_params)

#train model
gpb_trained = gpb.train(params=gpb_params, train_set=gpb_data, gp_model=gpb_model, 
                        num_boost_round=opt_params['best_iter'])

#using the params attribute we can get the parameters of gpb_trained
#gpb_trained.params

#save trained model to a file
path = os.path.join(os.getcwd(), "model.json")
gpb_trained.save_model(path)

#load model
loaded_model = gpb.Booster(model_file = path) 

#when trying to use the attribute params from loaded_model, an empty dictionary is printed
loaded_model.params #returns {}
fabsig commented 3 years ago

Many thanks for your helpful feedback!

It is true that when loading a model from file, the parameters are not loaded into Python again. The parameters are loaded in the corresponding C++ object, but they are not visible in Python. This is a feature that is inherited from LightGBM. In general, I try not to deviate too much from the LightGBM implementation concerning the tree-boosting part. Given that LightGBM does not support this and it is not a very important feature, I do not plan to add this feature. Note that the parameters are saved in the corresponding model file. But for easier use, it is probably better to save the parameters separately.

Another thing: your code is currently ignoring the gp_model, and I guess that this is not your intention. This is due to the fact that the line

gpb_model = gpb.GPModel(group_data=group).set_optim_params(params={"optimizer_cov": "gradient_descent"})

returns None (since set_optim_params returns nothing). I will change this such that in future releases of GPBoost, this works correctly. For now, you need to use two lines of code:

gpb_model = gpb.GPModel(group_data=group)
gpb_model.set_optim_params(params={"optimizer_cov": "gradient_descent"})
poroc300 commented 3 years ago

Many thanks for the thorough response and the advice to set up optimal parameters.