fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models
Other
571 stars 46 forks source link

Fatal errors during random effects model fitting #122

Closed NickStrEng closed 11 months ago

NickStrEng commented 11 months ago

While experimenting with the library in an effort to tackle a longitudinal data problem, I encounter seemingly numerical errors with the likelihood estimator of the GPModel.

[GPBoost] [Warning] The covariate data contains no column of ones, i.e., no intercept is included.
[GPBoost] [Fatal] NaN occurred in initial approximate negative marginal log-likelihood. Possible solutions: try other initial values ('init_cov_pars' and 'init_coef') or other tuning parameters in case you apply the GPBoost algorithm (e.g., learning_rate)

I am using a Bernoulli logit likelihood to model a binary response variable. I am not coupling it with a LightGBM model for fixed effects at the moment to keep things simple. I have trimmed my dataset down to 9,000 x 17 rows by columns, with 7 features included in a random slopes matrix. I have made sure there are no missing values or zero-variance features. I've also tried using different initial values for the covariance matrix (ones).

After playing around with the dataset configuration, I've managed to fit the model in a few cases, but then I face trouble making probability predictions on the training data themselves; I am getting the following error:

[GPBoost] [Fatal] NA or Inf occurred in the mode finding algorithm for the Laplace approximation

Any help would be greatly appreciated as I cannot decipher these logs from the repo.

fabsig commented 11 months ago

Thanks a lot for using GPBoost!

Could you please provide a complete reproducible example (code + data, e.g,, simulated data) in which your described behavior can be observed?

NickStrEng commented 11 months ago

Thanks for the prompt response @fabsig.

The data is sensitive; besides, the number of rows make it impossible to simulate or post anything in an anonymized way. I will however post the code I used if that helps

# groups
random_eff_groups = X_train[groups_col].copy()
X_train.drop(groups_col, axis=1, inplace=True)
# random coefficient feature matrix
X_rand_coeffs = X_train[rand_coeff_cols].copy()
# initial covariance mat values
init_cov_pars = np.ones(X_rand_coeffs.shape[1] + 1)
# Model specification
gp_model = gpb.GPModel(group_data=random_eff_groups, group_rand_coef_data=X_rand_coeffs, ind_effect_group_rand_coef=[1]* X_rand_coeffs.shape[1], likelihood="bernoulli_logit")
# model fitting
gp_model.fit(y=y_train.to_numpy().T.squeeze(), X=X_train, params=dict(init_cov_pars=init_cov_pars))
# Predict probabilities on train data
pred_train = gp_model.predict(predict_response=True, group_data_pred=random_eff_groups, group_rand_coef_data_pred=X_rand_coeffs, X_pred=X_train)

A few more notes that might be useful:

fabsig commented 11 months ago

Without a reproducible example, I cannot say anything. Could it be that there's an issue with the data such as the presence of NA's, extreme outliers etc.? Can you provide simulated data to reproduce the problem?

NickStrEng commented 11 months ago

Unfortunately this is real world data, I don't see a way to simulate them.

There are no missing values, but outliers exist. The underlying feature distributions are far from Gaussian; they generally have positive skew. Two features are categorical but I do not encode them for simplicity. I have standardized the numerical features. Summary statistics are shown below: image image

When I include a bias in the fixed effects covariate matrix, model fitting fails with the error I quote above. When I omit it, fitting may finish but convergence is not always achieved and the predict method on the training data fails. Note that I have tried all available optimisation algorithms.

I've had another run after dropping all data points beyond 1.5 std's from the mean per feature to see if outliers are the culprit - the model fitting error persists.

Can you please at least point me to the right direction here? Is the source of the error message a numerical instability, or could it be something different? I need to filter it down to the actual cause, otherwise I'll have to look at alternative solutions.

fabsig commented 11 months ago

Yes, it could be a numerical instability. But again, without a reproducible example, I cannot say something meaningful. Apologies! What's your ratio of 1 / 0 in the response label? Does the problem also occur if you drop some / or all predictor variables?

NickStrEng commented 11 months ago

Yes, it could be a numerical instability. But again, without a reproducible example, I cannot say something meaningful. Apologies! What's your ratio of 1 / 0 in the response label? Does the problem also occur if you drop some / or all predictor variables?

3% positive / 97% negative.

Error does not occur when I leave out certain features or reduce the rows considerably (my full dataset is >1m rows), but the situation appears to be erratic. Does the model make any distributional assumptions or prefer certain data types?

fabsig commented 11 months ago

The model is just a generalized linear mixed effects model (same as e.g. in lme4).

Again, without a reproducible synthetic example, I cannot say something...