Closed theScarro closed 11 months ago
Thank you for using GPBoost!
You are assuming a different model for prediction compared to training. If you provide gp_coords
and cluster_ids
in GPModel()
for training, you also need to provide this for predictions. I.e. use
pred = gpbst.predict(data=data_test[pred_vars],
gp_coords_pred=data_test['t'],
cluster_ids_pred=data_test['idcode'],
predict_var=False,
pred_latent=False)
instead of
pred = gpbst.predict(data=data_test[pred_vars],
group_data_pred=data_test['idcode'],
group_rand_coef_data_pred=data_test[["t","t_sq"]],
predict_var=False,
pred_latent=False)
Note that you need gpboost version 1.2.7.1 or latter for running this code correctly as I just discovered a bug in the handling of the argument cluster_ids_pred
when it is of integer type.
Hi Fabio, thank you for your answer.
After updating the version to 1.2.7.1 and changing the code, now the prediction works.
I have a couple of more question if I may. Reading your article on Longitudinal Data, in the approach 2.5 (Subject-Specific AR(1)/Gaussian process models), you mention that using this modeling approach every unit has the the temporal evolution of its random effect modeled in a separate way, using the GP with Exponential Kernel that is equivalent to an AR(1) model. Correct me if I'm wrong.
Is it possible then to use the trained model to predict the subject-specific response variable in an autoregressive manner? e.g. predict the next 4 step of the responsive variable autoregressively? If yes, is it already implemented or it must be done with an outer loop or something similar?
Thank you again for you amazing work
Yes, that's correct.
And yes, that's already implemented. You just need to provide the corresponding prediction times (in the future) in gp_coords_pred when calling the predict function.
Hello,
thank for the great work. I'm diving into you package to implement a Gaussian Process Model for longitudinal data. I started following your Medium article, but at the moment of prediction on test data it gives me the following error:
ValueError: The argument 'gp_coords_pred' is missing
I looked at the docs and it says:
gp_coords_pred (numpy array or pandas DataFrame with numeric data or None, optional (default=None)) – Prediction coordinates (=features) for Gaussian process (if there is a GP in the model)
I don't get very well what should I pass to this parameter. I taught it should take the time, but gives me another error. Thank you very much again.
Code I'm using:
`import gpboost as gpb import pandas as pd import numpy as np
Load data
data = pd.read_csv("https://raw.githubusercontent.com/fabsig/Compare_ML_HighCardinality_Categorical_Variables/master/data/wages.csv.gz") data = data.assign(t_sq = data['t']**2)# Add t^2
Partition into training and test data
n = data.shape[0] np.random.seed(n) permute_aux = np.random.permutation(n) train_idx = permute_aux[0:int(0.8 n)] test_idx = permute_aux[int(0.8 n):n] data_train = data.iloc[train_idx] data_test = data.iloc[test_idx]
Define fixed effects predictor variables
pred_vars = [col for col in data.columns if col not in ['ln_wage', 'idcode', 't', 't_sq']]
gp_model = gpb.GPModel( gp_coords=data_train['t'], cov_function='exponential', cluster_ids=data_train['idcode'], likelihood='gaussian' ) data_bst = gpb.Dataset(data=data_train[pred_vars], label=data_train['ln_wage'])
params = {'learning_rate': 0.01, 'max_depth': 2, 'min_data_in_leaf': 10, 'lambda_l2': 10, 'num_leaves': 2**10, 'verbose': 0} nrounds = 379
gpbst = gpb.train(params=params, train_set=data_bst, gp_model=gp_model, num_boost_round=nrounds) gp_model.summary() # Estimated random effects model
cov_pars = gp_model.get_cov_pars() phi_hat = np.exp(-1 / cov_pars['GP_range'][0]) sigma2_hat = cov_pars['GP_var'][0] * (1. - phi_hat ** 2) print("Estimated innovation variance and AR(1) coefficient of year effect:") print([sigma2_hat, phi_hat])
predict
pred = gpbst.predict( data=data_test[pred_vars], group_data_pred=data_test['idcode'], group_rand_coef_data_pred=data_test[["t","t_sq"]], predict_var=False, pred_latent=False ) y_pred = pred['response_mean'] np.mean((data_test['ln_wage'] - y_pred)**2) `