Possible training issue with Booster + Random Effect models

AhmetZamanis commented 11 months ago

Hi, thanks for developing this interesting package & modeling approach.

I've been recently performing a regression modeling exercise in Python, where I compare the performance of standard tree boosting algorithms with various GPBoost configurations, and with mixed linear models fitted using GPModel.

I observed strange behavior in the performance & training of the Booster + random intercept models. To summarize, the fixed effect component always predicts the response variable mean, and the Booster doesn't train past a few rounds in any of the hyperparameter configurations I tried (and they all yield practically the same validation scores). I am wondering if the Booster fails to learn from the data either due to a bug, or a quirk in the modeling. The issue does not seem to exist with just a Booster model by itself, without random effects, which trains for 90+ rounds and outputs considerably better & varying predictions.

I'll summarize my dataset & experiments below. I have checked my code extensively using the examples in the documentation, and I don't think I am making a mistake with the GPBoost syntax. I can still share the full code if necessary, which is in Jupyter notebook format along with the outputs, but I don't think I'm allowed to share the dataset, so it may not be possible to run & reproduce it.

Dataset

The data consists of 100k+ rows. Each row represents an order delivery and various information about it. The goal is to predict the delivery durations using various order attributes. After some feature engineering I have close to 30 predictors.

One categorical column, store_id, is grouping variable: It records the unique ID of the store that fulfilled the order. There are 5000+ stores in the dataset, and each one has anywhere from hundreds to a single delivery (one delivery = one observation). The main goal of my experiment is to model store_id first as a fixed effect predictor with target encoding, then with a random effect intercept for each store, and compare the performances of various models.
Other predictors are mostly numeric. Some are categorical with fewer levels, and these are either target or dummy encoded. Some time / seasonality features are cyclical encoded.

Models / experiments

Below are the model configurations I tried, and some notes about their outputs.

All models below are trained, tuned (if applicable), and tested with the same 60% - 20% - 20% training / validation / testing time series split.
The tree boosting models were all tuned with Optuna extensively, across a broad parameter space. The validation & early stopping configurations are also all the same (5000 maximum rounds, early stopping at 50).

LGBM: Standard fixed effects LightGBM, with store_id as a target encoded predictor. Trained directly with package lightgbm.

Testing performance: RMSE 909, MAPE 22.5%.
The best performing tune trains for 100+ boosting rounds. Other tunes have considerably different validation scores & boosting rounds.
The SHAP scores show numerous predictors make a considerable impact on the predictions. store_id is the top predictor.

LM: Fixed effects linear regression, with store_id as a target encoded predictor. Trained with scikit-learn's LinearRegression.

RMSE 950, MAPE 24.5%.
Again, SHAP scores show the predictors make significant contributions.
The model coefficients also signify significant effects by the predictors.

GPB1: GPBoost model with Booster + a random intercept for store_id.

RMSE 1068, MAPE 27.6%.
The booster only trains for 2-3 rounds in any hyperparameter configuration, and they all have virtually the same validation scores.
The fixed effect prediction for every observation is constant at the response variable mean (2816 seconds).
The random effect predictions vary mostly from -200 to 200, but go as high as 1000.
The SHAP scores show virtually zero contribution from the fixed predictors.
Fitting a simple random effect model with just a fixed intercept + random intercept for store_id (using GPModel) yields virtually the same predictions & testing scores.

LMM: Mixed effects linear regression, random intercept for store_id, trained with GPModel.

RMSE 933, MAPE 24%.
Again, SHAP scores show the fixed predictors make significant contributions.
Fairly similar fixed effect coefficients to LM.

I also tried the following experiments for troubleshooting:

LGBM with store_id completely dropped from the model. The performance suffers slightly (RMSE 926, MAPE 23%) , but the model still trains for 90+ rounds, outputs varying predictions, and the predictors make significant contributions to SHAP. This confirms there is still considerable signal to be captured without store_id.
A Booster only fixed effects model with GPBoost, with store_id completely dropped from the model. Very similar results as the previous line. This shows Booster works as expected by itself.
A Booster + random effect model, but with a randomly generated grouping variable with 100 levels, and with store_id as a fixed effect predictor. The random effect predictions are close to zero as expected, but the fixed effect predictions are still constant at the response mean. All hyperparameter configurations train for the maximum of 5000 rounds with virtually no difference in validation scores.

All of this suggests to me that somehow the Booster component of GPB1 is not working properly, or the presence of a random effect component is somehow preventing the Booster to learn the fixed effects from the data. The random effect for store_id captures a lot of the variance, and the remaining fixed predictors are not very predictive, but they still have considerable contributions in other models, which all perform considerably better than GPB1 even without store_id.

I am curious if this is expected behavior, and if there's a modeling-related reason for it. If it sounds like a genuine software bug, please let me know and I can provide technical details. Thanks in advance for your time.

fabsig commented 11 months ago

Thanks a lot for using GPBoost!

Could you please provide a complete reproducible example (code + data, e.g,, simulated data) in which your described behavior is observed?

AhmetZamanis commented 11 months ago

Thanks for getting back.

It may not be possible to share the data, but I'll check again. If not, I'll see if I can replicate the behavior with a similar dataset in the same environment.

AhmetZamanis commented 11 months ago

I performed another test on a different panel dataset, with the same environment. GPBoost and Booster worked as expected (and likely better than fixed effects LightGBM), with different hyperparameter tunes training for different numbers of rounds & strongly impacting performance. I think this shows there is no software bug, but just a peculiar modeling outcome with the delivery duration dataset.

I'll share my code & the delivery duration data source below if you'd like to review my experiments. Please feel free to close the issue if you wish, as the package seems to be working as expected.

Code

The "troubleshooting" branch of this repository has all the code & experiments I mentioned as Jupyter notebooks.

You can install my environment with requirements.txt.
The notebook 1_DataPrep will repeat my data prep & feature engineering steps and save the .csv file for the modeling notebooks. You can just run the entire notebook once.
The other notebooks have the code for the experiments I summarized in my first post, and my notes on the outcomes. The parameter tuning for GPBoost models can take a while, but there's no need to do more than a few dozen trials to understand if the models are learning & improving. LightGBM tuning takes a few minutes at most with the GPU option.
You can also view my outputs before running the modeling notebooks.

Dataset

The dataset is from an example take-home project on StrataScratch, available for download if you create a free account. I'd rather not share it directly as I'm not sure StrataScratch allows that.

Thanks again for your time.

fabsig commented 11 months ago

OK, thanks for letting me know. I currently don't have time to reproduce / review this example myself. But good to know that everything is running as expected.

fabsig / GPBoost