fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models
Other
530 stars 42 forks source link

Medium Article on GPBoost for Longitudinal & Panel Data #118

Closed theScarro closed 7 months ago

theScarro commented 7 months ago

Hello,

thank for the great work. I'm diving into you package to implement a Gaussian Process Model for longitudinal data. I started following your Medium article, but at the moment of prediction on test data it gives me the following error:

ValueError: The argument 'gp_coords_pred' is missing

I looked at the docs and it says:

gp_coords_pred (numpy array or pandas DataFrame with numeric data or None, optional (default=None)) – Prediction coordinates (=features) for Gaussian process (if there is a GP in the model)

I don't get very well what should I pass to this parameter. I taught it should take the time, but gives me another error. Thank you very much again.

Code I'm using:

`import gpboost as gpb import pandas as pd import numpy as np

Load data

data = pd.read_csv("https://raw.githubusercontent.com/fabsig/Compare_ML_HighCardinality_Categorical_Variables/master/data/wages.csv.gz") data = data.assign(t_sq = data['t']**2)# Add t^2

Partition into training and test data

n = data.shape[0] np.random.seed(n) permute_aux = np.random.permutation(n) train_idx = permute_aux[0:int(0.8 n)] test_idx = permute_aux[int(0.8 n):n] data_train = data.iloc[train_idx] data_test = data.iloc[test_idx]

Define fixed effects predictor variables

pred_vars = [col for col in data.columns if col not in ['ln_wage', 'idcode', 't', 't_sq']]

gp_model = gpb.GPModel( gp_coords=data_train['t'], cov_function='exponential', cluster_ids=data_train['idcode'], likelihood='gaussian' ) data_bst = gpb.Dataset(data=data_train[pred_vars], label=data_train['ln_wage'])

params = {'learning_rate': 0.01, 'max_depth': 2, 'min_data_in_leaf': 10, 'lambda_l2': 10, 'num_leaves': 2**10, 'verbose': 0} nrounds = 379

gpbst = gpb.train(params=params, train_set=data_bst, gp_model=gp_model, num_boost_round=nrounds) gp_model.summary() # Estimated random effects model

cov_pars = gp_model.get_cov_pars() phi_hat = np.exp(-1 / cov_pars['GP_range'][0]) sigma2_hat = cov_pars['GP_var'][0] * (1. - phi_hat ** 2) print("Estimated innovation variance and AR(1) coefficient of year effect:") print([sigma2_hat, phi_hat])

predict

pred = gpbst.predict( data=data_test[pred_vars], group_data_pred=data_test['idcode'], group_rand_coef_data_pred=data_test[["t","t_sq"]], predict_var=False, pred_latent=False ) y_pred = pred['response_mean'] np.mean((data_test['ln_wage'] - y_pred)**2) `

fabsig commented 7 months ago

Thank you for using GPBoost!

You are assuming a different model for prediction compared to training. If you provide gp_coords and cluster_ids in GPModel() for training, you also need to provide this for predictions. I.e. use

pred = gpbst.predict(data=data_test[pred_vars],
    gp_coords_pred=data_test['t'],
    cluster_ids_pred=data_test['idcode'],
    predict_var=False,
    pred_latent=False)

instead of

pred = gpbst.predict(data=data_test[pred_vars],
    group_data_pred=data_test['idcode'],
    group_rand_coef_data_pred=data_test[["t","t_sq"]],
    predict_var=False,
    pred_latent=False)

Note that you need gpboost version 1.2.7.1 or latter for running this code correctly as I just discovered a bug in the handling of the argument cluster_ids_pred when it is of integer type.

surferzskan commented 6 months ago

Hi Fabio, thank you for your answer.

After updating the version to 1.2.7.1 and changing the code, now the prediction works.

I have a couple of more question if I may. Reading your article on Longitudinal Data, in the approach 2.5 (Subject-Specific AR(1)/Gaussian process models), you mention that using this modeling approach every unit has the the temporal evolution of its random effect modeled in a separate way, using the GP with Exponential Kernel that is equivalent to an AR(1) model. Correct me if I'm wrong.

Is it possible then to use the trained model to predict the subject-specific response variable in an autoregressive manner? e.g. predict the next 4 step of the responsive variable autoregressively? If yes, is it already implemented or it must be done with an outer loop or something similar?

Thank you again for you amazing work

fabsig commented 6 months ago

Yes, that's correct.

And yes, that's already implemented. You just need to provide the corresponding prediction times (in the future) in gp_coords_pred when calling the predict function.