Closed NickStrEng closed 11 months ago
Thanks a lot for using GPBoost!
Could you please provide a complete reproducible example (code + data, e.g,, simulated data) in which your described behavior can be observed?
Thanks for the prompt response @fabsig.
The data is sensitive; besides, the number of rows make it impossible to simulate or post anything in an anonymized way. I will however post the code I used if that helps
# groups
random_eff_groups = X_train[groups_col].copy()
X_train.drop(groups_col, axis=1, inplace=True)
# random coefficient feature matrix
X_rand_coeffs = X_train[rand_coeff_cols].copy()
# initial covariance mat values
init_cov_pars = np.ones(X_rand_coeffs.shape[1] + 1)
# Model specification
gp_model = gpb.GPModel(group_data=random_eff_groups, group_rand_coef_data=X_rand_coeffs, ind_effect_group_rand_coef=[1]* X_rand_coeffs.shape[1], likelihood="bernoulli_logit")
# model fitting
gp_model.fit(y=y_train.to_numpy().T.squeeze(), X=X_train, params=dict(init_cov_pars=init_cov_pars))
# Predict probabilities on train data
pred_train = gp_model.predict(predict_response=True, group_data_pred=random_eff_groups, group_rand_coef_data_pred=X_rand_coeffs, X_pred=X_train)
A few more notes that might be useful:
Note: no convergence after the maximal number of iterations
Could this be a reason for the _predictproba errors?Without a reproducible example, I cannot say anything. Could it be that there's an issue with the data such as the presence of NA's, extreme outliers etc.? Can you provide simulated data to reproduce the problem?
Unfortunately this is real world data, I don't see a way to simulate them.
There are no missing values, but outliers exist. The underlying feature distributions are far from Gaussian; they generally have positive skew. Two features are categorical but I do not encode them for simplicity. I have standardized the numerical features. Summary statistics are shown below:
When I include a bias in the fixed effects covariate matrix, model fitting fails with the error I quote above. When I omit it, fitting may finish but convergence is not always achieved and the predict method on the training data fails. Note that I have tried all available optimisation algorithms.
I've had another run after dropping all data points beyond 1.5 std's from the mean per feature to see if outliers are the culprit - the model fitting error persists.
Can you please at least point me to the right direction here? Is the source of the error message a numerical instability, or could it be something different? I need to filter it down to the actual cause, otherwise I'll have to look at alternative solutions.
Yes, it could be a numerical instability. But again, without a reproducible example, I cannot say something meaningful. Apologies! What's your ratio of 1 / 0 in the response label? Does the problem also occur if you drop some / or all predictor variables?
Yes, it could be a numerical instability. But again, without a reproducible example, I cannot say something meaningful. Apologies! What's your ratio of 1 / 0 in the response label? Does the problem also occur if you drop some / or all predictor variables?
3% positive / 97% negative.
Error does not occur when I leave out certain features or reduce the rows considerably (my full dataset is >1m rows), but the situation appears to be erratic. Does the model make any distributional assumptions or prefer certain data types?
The model is just a generalized linear mixed effects model (same as e.g. in lme4).
Again, without a reproducible synthetic example, I cannot say something...
While experimenting with the library in an effort to tackle a longitudinal data problem, I encounter seemingly numerical errors with the likelihood estimator of the GPModel.
I am using a Bernoulli logit likelihood to model a binary response variable. I am not coupling it with a LightGBM model for fixed effects at the moment to keep things simple. I have trimmed my dataset down to 9,000 x 17 rows by columns, with 7 features included in a random slopes matrix. I have made sure there are no missing values or zero-variance features. I've also tried using different initial values for the covariance matrix (ones).
After playing around with the dataset configuration, I've managed to fit the model in a few cases, but then I face trouble making probability predictions on the training data themselves; I am getting the following error:
[GPBoost] [Fatal] NA or Inf occurred in the mode finding algorithm for the Laplace approximation
Any help would be greatly appreciated as I cannot decipher these logs from the repo.