fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models
Other
571 stars 46 forks source link

Question: temporal grouping #136

Closed mglowacki100 closed 7 months ago

mglowacki100 commented 7 months ago

Very interesting package. I've use-case for it but I'm not sure if GPBoost is good fit for it. So, here is short description of my problem:

GPBoost takes group_data parameter and inital results are close to xgboost, but my question is: does it take grouping into account? Let's say in train set I've weeks: 1,2,3,4, ..., 1000 , in test set: 1005, 1006, ..., 1060 (small gap betewen train and test to avoid target leak), if 'week' it is treated as ordinary categorical then values 1005, 1006, ..., 1060 are interpreted as unknown at prediction time - like in ordinry GBDT. Is there a way to add into GPBoost 'inductive bias' that weeks 4 and 5 are much closer than e.g. 4 and 500 ?

fabsig commented 7 months ago

Thanks a lot for your interest in GPBoost!

If you use grouped random effects (with the group_data argument), predictions of new groups / categories (weeks in your cases) will ignore the distance to existing ones. In this case, you can use a Gaussian process (GP) instead of grouped random effects. This will take the distance into account (like time series models, in fact a GP with an exponential covariance function corresponds to an AR(1) model). You can do this by passing the week variable to the gp_coords argument instead of group_data. Note that if the number of data points is large, you need to use an approximation for faster computations such as gp_approx = "vecchia".

This blog post might also be helpful for you.

mglowacki100 commented 7 months ago

Thanks a lot :)