fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models
Other
574 stars 46 forks source link

Longitudinal Data #86

Closed simonprovost closed 1 year ago

simonprovost commented 1 year ago

Dear Authors,

Thank you so much for this GPBoost approach. I would like to know if I could use GPBoost with longitudinal data represented as follows, which I believe to be the case as I have read on several websites that this is the case but did not find any exemples:

The following is a representation of the data (very simplified):

The sole non-longitudinal characteristic is the patient's name; the rest are longitudinally represented using the suffixes _1, _2, and _3 to designate waves (timepoints) one, two, and three (with e.g 1 year gab between waves). The death column is the class variable for predicting mortality (binary).

patient_name, age_1, biomarkerX_1, smoke_1, age_2, biomarkerX_2, smoke_2, age_3, biomarkerX_3, smoke_3, death

Cheers,

fabsig commented 1 year ago

Thank you for your interest in GPBoost!

Yes, this can be done. I assume that, e.g., age_1, age_2, etc. represent the same variable measured at different time points. What you need to do is restructure the data in such a way that you have only one age variable / column, and add a time variable (this will mean 3 time more data points but "less variables").

Then you need to decide which type of longitudinal (random effects) model you want to use. Here are some longitudinal / panel data examples: https://github.com/fabsig/GPBoost/blob/master/examples/python-guide/panel_data_example.py