fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models
Other
567 stars 46 forks source link

Article on Panel data management? #88

Closed simonprovost closed 1 year ago

simonprovost commented 1 year ago

Hi Fabsig,

We were wondering if you could send us in the direction of a Medium or Scientific publication of you/your team that describes how GPboost handles panel data utilising ME and ML estimators? E.g, one of the question this morning at the health lab was are you discarding time indices to perform everything or the actual time indices are kept and used, etc.? As an answer we were convinced that they are kept as long as you are using Mixed Effect. However, having an article or publication would make things easier, we believe; if you do not mind :) In case you do not have we will include our questions below.

Very excited by GPBoost anyway!

Best wishes,

fabsig commented 1 year ago

No, this does not yet exists. I will put it on my to do list and let you know once I have something. Thank you for the suggestion!

Let me know if you have particular questions that could be addressed. Also, in case you happen to have / know a data set that could be used, I would be glad about it.

simonprovost commented 1 year ago

Thanks for your promptness. Here is a series of question to start:

Speaking of data. We were unable to provide a complete dataset because our health team is still working on it, but we drew inspiration from the following work, which adapt a random forest to handle the longitudinal element of the data and discusses their data in section 3:

The English Longitudinal Study of Ageing (ELSA) is currently one of the most prominent populational studies of ageing [Abell et al., 2018, Banks et al., 2019]. The ELSA has, in each of its waves (timepoints), thousands of respondents from inhabitants of United Kingdom households, which take part in a core interview every two years (the time interval between two consecutive waves), answering questions about various aspects of their lives, including demographic, health, wellbeing and economics. 

Data from this core questionnaire is used to create the class labels for all our datasets, and to create the ELSA-core datasets. For this project, we used data from the core waves 1-7 (2002-2014) to create the features of the ELSA-core datasets, and data from the core wave 8 to create the class label for all our datasets.

Ribeiro, C.E., 2022. New Longitudinal Classification Approaches and Applications to Age-Related Disease Data. University of Kent (United Kingdom).  


simonprovost commented 1 year ago

Any luck you had seen our questions :) ?

Cheers

fabsig commented 1 year ago

Thank you. Yes, I took notice :-). As you might imagine, I have other things on my agenda with higher priority. No guarantees that this will be done fast...

simonprovost commented 1 year ago

Certainly that makes perfect sense. Good luck with your professor's responsibilities; we look forward to hearing from you at your earliest convenience for these questions, not the article, which could surely take significantly longer and you might not have that time for now, which we totally understand 🙏

We appreciate your time and kindness professor, Have a great day !

simonprovost commented 1 year ago

@fabsig I am sorry for the spamming, but if you do not have time feel free to ignore.

Would GPBoost have the ability to classify longitudinal binary data? I am reading a paper, in addition to the one you created for GPboost (which is not focused on panel data as you indicated), in which regression is the most discussed topic (RMSE metric e.g) and classification-based concepts are barely used (F1, Recall, AUROC, etc.). While however, a quick search of your package reveals GPBoostClassifier, which I assume is what I am seeking, correct?

Furthermore, happy that my PR was of help.

Cheers,

fabsig commented 1 year ago

Yes. In brief, with GPBoost you can do (almost) everything for classification in the same way as for regression. Have a look at the demo code: https://github.com/fabsig/GPBoost/blob/master/examples/python-guide/generalized_linear_Gaussian_process_mixed_effects_models.py

Just set likelihood = "bernoulli_logit" (=classification) instead of likelihood = "gaussian" (=regression).

simonprovost commented 1 year ago

Hi @fabsig ,

I hope that all is okay with you. I wished to inform you on my progress onto the understanding of GPBoost as I have had time today to give it some time.

Initially, I reviewed mixed effect models (MEMs) and generalised linear mixed effect models (GLMMs) (and with trees), with GLMMs providing additional flexibility than MEMs, e.g. through the use of link functions. Following that, thanks to your Medium blog post, I was able to comprehend how GPboost may operate for panel data after gaining an understanding of these concepts, yet let's see this further next.

First and foremost, would you please, kindly let us know if your package fundamentally implements the GLMM's tree concept/an improvement of it? By GLMM's tree concept, I mean this one. If this is accurate, my understanding of how GLMMs /GPBoost function broadly is as follows:

Note, however, that I have not gone into great length because we lack the time/space to explain everything in depth as we would in a paper for instance:

Generalized linear mixed effect models (GLMMs) are a flexible expansion of mixed effect models (MEMs) that enable the examination of a broader range of research topics (classification/regression, etc), for instance, by the elegant addition of a link functions (e.g, log likelihood etc). As a result, GLMMs have been applied to tree-based machine learning algorithm analysis, where (trees) they offer a number of benefits that I assume the reader of this comment to know. Thus, if GPBoost employs the GLMMs paradigm or an improvement, a decision tree would function as follows (for GLMMS at the very least):

In the process of fitting a decision tree using Generalized Linear Mixed Models (GLMMs), the fixed and random effects coefficients are determined at the start of each i_th node's splitting procedure by employing a mixed-effects model (specifically, a GLMM) to the data within that particular i_th node. Following this, the decision tree's splitting process utilises the GLMM's coefficients, and the variable exhibiting the strongest association with the outcome variable is selected for splitting (for instance, by employing the GLMM's log likelihood link function instead of information gain or gini entropy). As a result, GLMM trees serve as adaptable statistical instruments capable of identifying relationships between predictor and outcome variables in multilevel and longitudinal datasets. This makes them valuable in addressing various (in our case, clinical) decision-making inquiries.

GPBoost on the other hand combines the strengths of gradient boosting (through ensemble tree learners like LightGBM) and generalized linear mixed effect models (GLMMs). For a given i_th node, GPboost computes the fixed effect using a GLMM, then by employing the given link function (e.g., log-likelihood) and the global estimate of random effects for the entire model, making it more efficient and scalable than GLMM trees that compute both fixed and random at each i_th node, determines the optimal split based on the GLM's link function outputs. This enables GPBoost to handle a wide variety of data structures and complexities, making it a potent modelling tool for hierarchical data, panel data, and other complex data structures. Furthermore, the use of LightGBM allows for large scale machine learning applications too.

  1. If you agree with the three/four preceding paragraphs in general, then GPBoost is the appropriate solution to employ as of today, at least for our clinical use case! Your opinions?
  2. However, I am curious as to why you would use global random effects as opposed to random effects per node? GLM are expensive it is known, so I am sure this is faster than calculating it for each node, but was there another motivation?
  3. If you could pin-point where in the C++ dependancy code we can see the split's procedure, we also would appreciate but this is okay if you do not have time for this.

Thank you for your time @fabsig

simonprovost commented 1 year ago

Hi @fabsig ,

A work been sent to hour hospital lab yesterday.

[Jie et al., 2017] Biao Jie, Mingxia Liu, Jun Liu, Daoqiang Zhang, and Dinggang Shen. Temporally constrained group sparse learning for longitudinal data analysis in alzheimer’s disease. IEEE Transactions on Biomedical Engineering, 64(1):238–249, 2017.

This current work is [Jie et al., 2017], which categorises supervised ML methods that use data from multiple time-points into four categories, based on the number of input and output time-points used by the ML method: (1) Single-time-point Input and Single-time-point Output (SISO), (2) Single-time-point Input and Multiple- time-points Output (SIMO), (3) Multiple-time-points Input and Single-time-point Output (MISO), and (4) Multiple-time- points Input and Multiple-time-points Output (MIMO).

In a nutshell, a SISO dataset has a single wave with features and target variables. A SIMO dataset also has features from a single wave, but the target variables span multiple waves. A MISO dataset has features in multiple waves but target variables in a single wave (typically, the last wave). A MIMO dataset has both features and target variables available in multiple waves.

Hence, I reckon than GPBoost handle only MIMO, correct? Cannot for example handle MISO, correct?

simonprovost commented 1 year ago

I think it has taken too much time. I'll close the comment, open it back professor when you are little less busy and ready to answer. Cheers.

fabsig commented 1 year ago

Hi @simonprovost: It took a while. but I finally managed to write a blog post on handling longitudinal / panel data with GPBoost.