Question: temporal grouping

mglowacki100 commented 7 months ago

Very interesting package. I've use-case for it but I'm not sure if GPBoost is good fit for it. So, here is short description of my problem:

every week, I receive new portion of data
for every week amount of data is more or less the same but I can't identify entities across weeks (you can imagine this as weekly health status of chronically ill patients, new patients appear, old disappear but for neighbouring weeks bulk of patients is the same)
I train/test split by time e.g. take first 10 years as train set, for test last one year
with ordinary gdbt like xgboost, I just ignore 'week' column, the problem is that performance varies a lot across weeks, but it is not seasonal, so I can't create auxilary column like 'season of the year' from the 'week' to have sensible grouping

GPBoost takes group_data parameter and inital results are close to xgboost, but my question is: does it take grouping into account? Let's say in train set I've weeks: 1,2,3,4, ..., 1000 , in test set: 1005, 1006, ..., 1060 (small gap betewen train and test to avoid target leak), if 'week' it is treated as ordinary categorical then values 1005, 1006, ..., 1060 are interpreted as unknown at prediction time - like in ordinry GBDT. Is there a way to add into GPBoost 'inductive bias' that weeks 4 and 5 are much closer than e.g. 4 and 500 ?

fabsig commented 7 months ago

Thanks a lot for your interest in GPBoost!

If you use grouped random effects (with the group_data argument), predictions of new groups / categories (weeks in your cases) will ignore the distance to existing ones. In this case, you can use a Gaussian process (GP) instead of grouped random effects. This will take the distance into account (like time series models, in fact a GP with an exponential covariance function corresponds to an AR(1) model). You can do this by passing the week variable to the gp_coords argument instead of group_data. Note that if the number of data points is large, you need to use an approximation for faster computations such as gp_approx = "vecchia".

This blog post might also be helpful for you.

mglowacki100 commented 7 months ago

Thanks a lot :)

fabsig / GPBoost

Question: temporal grouping #136