Fatal errors during random effects model fitting

NickStrEng commented 11 months ago

While experimenting with the library in an effort to tackle a longitudinal data problem, I encounter seemingly numerical errors with the likelihood estimator of the GPModel.

[GPBoost] [Warning] The covariate data contains no column of ones, i.e., no intercept is included.
[GPBoost] [Fatal] NaN occurred in initial approximate negative marginal log-likelihood. Possible solutions: try other initial values ('init_cov_pars' and 'init_coef') or other tuning parameters in case you apply the GPBoost algorithm (e.g., learning_rate)

I am using a Bernoulli logit likelihood to model a binary response variable. I am not coupling it with a LightGBM model for fixed effects at the moment to keep things simple. I have trimmed my dataset down to 9,000 x 17 rows by columns, with 7 features included in a random slopes matrix. I have made sure there are no missing values or zero-variance features. I've also tried using different initial values for the covariance matrix (ones).

After playing around with the dataset configuration, I've managed to fit the model in a few cases, but then I face trouble making probability predictions on the training data themselves; I am getting the following error:

[GPBoost] [Fatal] NA or Inf occurred in the mode finding algorithm for the Laplace approximation

Any help would be greatly appreciated as I cannot decipher these logs from the repo.

fabsig commented 11 months ago

Thanks a lot for using GPBoost!

Could you please provide a complete reproducible example (code + data, e.g,, simulated data) in which your described behavior can be observed?

NickStrEng commented 11 months ago

Thanks for the prompt response @fabsig.

The data is sensitive; besides, the number of rows make it impossible to simulate or post anything in an anonymized way. I will however post the code I used if that helps

# groups
random_eff_groups = X_train[groups_col].copy()
X_train.drop(groups_col, axis=1, inplace=True)
# random coefficient feature matrix
X_rand_coeffs = X_train[rand_coeff_cols].copy()
# initial covariance mat values
init_cov_pars = np.ones(X_rand_coeffs.shape[1] + 1)
# Model specification
gp_model = gpb.GPModel(group_data=random_eff_groups, group_rand_coef_data=X_rand_coeffs, ind_effect_group_rand_coef=[1]* X_rand_coeffs.shape[1], likelihood="bernoulli_logit")
# model fitting
gp_model.fit(y=y_train.to_numpy().T.squeeze(), X=X_train, params=dict(init_cov_pars=init_cov_pars))
# Predict probabilities on train data
pred_train = gp_model.predict(predict_response=True, group_data_pred=random_eff_groups, group_rand_coef_data_pred=X_rand_coeffs, X_pred=X_train)

A few more notes that might be useful:

The number of groups is relatively high compared to the number of rows (about 1:33 group-to-rows ratio).
Dataframe dtypes are either int64 or float64.
The feature matrix is a bit ill-conditioned, with a condition number of 30~40.
In most cases, convergence is not attained: Note: no convergence after the maximal number of iterations Could this be a reason for the _predictproba errors?
The numerical error in _predictproba seems to occur when number of rows goes above ~6,000

fabsig commented 11 months ago

Without a reproducible example, I cannot say anything. Could it be that there's an issue with the data such as the presence of NA's, extreme outliers etc.? Can you provide simulated data to reproduce the problem?

NickStrEng commented 11 months ago

Unfortunately this is real world data, I don't see a way to simulate them.

There are no missing values, but outliers exist. The underlying feature distributions are far from Gaussian; they generally have positive skew. Two features are categorical but I do not encode them for simplicity. I have standardized the numerical features. Summary statistics are shown below:

When I include a bias in the fixed effects covariate matrix, model fitting fails with the error I quote above. When I omit it, fitting may finish but convergence is not always achieved and the predict method on the training data fails. Note that I have tried all available optimisation algorithms.

I've had another run after dropping all data points beyond 1.5 std's from the mean per feature to see if outliers are the culprit - the model fitting error persists.

Can you please at least point me to the right direction here? Is the source of the error message a numerical instability, or could it be something different? I need to filter it down to the actual cause, otherwise I'll have to look at alternative solutions.

fabsig commented 11 months ago

Yes, it could be a numerical instability. But again, without a reproducible example, I cannot say something meaningful. Apologies! What's your ratio of 1 / 0 in the response label? Does the problem also occur if you drop some / or all predictor variables?

NickStrEng commented 11 months ago

Yes, it could be a numerical instability. But again, without a reproducible example, I cannot say something meaningful. Apologies! What's your ratio of 1 / 0 in the response label? Does the problem also occur if you drop some / or all predictor variables?

3% positive / 97% negative.

Error does not occur when I leave out certain features or reduce the rows considerably (my full dataset is >1m rows), but the situation appears to be erratic. Does the model make any distributional assumptions or prefer certain data types?

fabsig commented 11 months ago

The model is just a generalized linear mixed effects model (same as e.g. in lme4).

Again, without a reproducible synthetic example, I cannot say something...

fabsig / GPBoost

Fatal errors during random effects model fitting #122