fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models
Other
571 stars 46 forks source link

Possible training issue with Booster + Random Effect models #117

Closed AhmetZamanis closed 11 months ago

AhmetZamanis commented 11 months ago

Hi, thanks for developing this interesting package & modeling approach.

I've been recently performing a regression modeling exercise in Python, where I compare the performance of standard tree boosting algorithms with various GPBoost configurations, and with mixed linear models fitted using GPModel.

I observed strange behavior in the performance & training of the Booster + random intercept models. To summarize, the fixed effect component always predicts the response variable mean, and the Booster doesn't train past a few rounds in any of the hyperparameter configurations I tried (and they all yield practically the same validation scores). I am wondering if the Booster fails to learn from the data either due to a bug, or a quirk in the modeling. The issue does not seem to exist with just a Booster model by itself, without random effects, which trains for 90+ rounds and outputs considerably better & varying predictions.

I'll summarize my dataset & experiments below. I have checked my code extensively using the examples in the documentation, and I don't think I am making a mistake with the GPBoost syntax. I can still share the full code if necessary, which is in Jupyter notebook format along with the outputs, but I don't think I'm allowed to share the dataset, so it may not be possible to run & reproduce it.

Dataset

The data consists of 100k+ rows. Each row represents an order delivery and various information about it. The goal is to predict the delivery durations using various order attributes. After some feature engineering I have close to 30 predictors.

Models / experiments

Below are the model configurations I tried, and some notes about their outputs.

LGBM: Standard fixed effects LightGBM, with store_id as a target encoded predictor. Trained directly with package lightgbm.

LM: Fixed effects linear regression, with store_id as a target encoded predictor. Trained with scikit-learn's LinearRegression.

GPB1: GPBoost model with Booster + a random intercept for store_id.

LMM: Mixed effects linear regression, random intercept for store_id, trained with GPModel.

I also tried the following experiments for troubleshooting:

All of this suggests to me that somehow the Booster component of GPB1 is not working properly, or the presence of a random effect component is somehow preventing the Booster to learn the fixed effects from the data. The random effect for store_id captures a lot of the variance, and the remaining fixed predictors are not very predictive, but they still have considerable contributions in other models, which all perform considerably better than GPB1 even without store_id.

I am curious if this is expected behavior, and if there's a modeling-related reason for it. If it sounds like a genuine software bug, please let me know and I can provide technical details. Thanks in advance for your time.

fabsig commented 11 months ago

Thanks a lot for using GPBoost!

Could you please provide a complete reproducible example (code + data, e.g,, simulated data) in which your described behavior is observed?

AhmetZamanis commented 11 months ago

Thanks for getting back.

It may not be possible to share the data, but I'll check again. If not, I'll see if I can replicate the behavior with a similar dataset in the same environment.

AhmetZamanis commented 11 months ago

I performed another test on a different panel dataset, with the same environment. GPBoost and Booster worked as expected (and likely better than fixed effects LightGBM), with different hyperparameter tunes training for different numbers of rounds & strongly impacting performance. I think this shows there is no software bug, but just a peculiar modeling outcome with the delivery duration dataset.

I'll share my code & the delivery duration data source below if you'd like to review my experiments. Please feel free to close the issue if you wish, as the package seems to be working as expected.

Code

The "troubleshooting" branch of this repository has all the code & experiments I mentioned as Jupyter notebooks.

Dataset

The dataset is from an example take-home project on StrataScratch, available for download if you create a free account. I'd rather not share it directly as I'm not sure StrataScratch allows that.

Thanks again for your time.

fabsig commented 11 months ago

OK, thanks for letting me know. I currently don't have time to reproduce / review this example myself. But good to know that everything is running as expected.