POskar / GenerativeMLForPrecisionMedicine

Master's thesis project on the topic of "Generative Machine Learning for Precision Medicine"
https://www.simula.no/education/masters-students/masters-projects/generative-machine-learning-precision-medicine/
0 stars 1 forks source link

Build statistical-based models #7

Closed POskar closed 5 months ago

POskar commented 8 months ago

Statistical-based models: kernel density estimation gaussian mixed model

POskar commented 8 months ago

Hi Sounds like a good plan. Given a data matrix n x p (n observations of dimension p), estimation of a multivariate gaussian distribution involves one column at the time to estimate the mean vector and two columns at the time to estimate the covariances. A standard procedure to estimate a mean value is to take the average of the values that are not missing. Similar when estimating a covariance, estimate over the pairs with no missing values. In R it is the function FitMVN that I think handles missing values in this way.

Else I think the Expectation-Maximization algorithm is a well-known and efficient algorithm for randomly missing values

https://www.jstor.org/stable/pdf/2984875.pdf?casa_token=7GEMhOr6CMkAAAAA:Ytjk2UQe-fcQzU-FxxEwPCWlElpemvu3Mmc8j1TQMI-xo6ybZvaj1W-j-ovryTgMc9yBfK17HNWFaeFAI8PB8VPV92VUYwmyQH3_b5wrUC7lQ6J6cK3eFw

I think the function em.norm in the norm package is an implementation of this method: https://cran.r-project.org/web/packages/norm/norm.pdf

Best regards / beste hilsen

Hugo

From: Luis M. Lopez-Ramos [luis@simula.no](mailto:luis@simula.no) Sent: Friday, January 26, 2024 4:49 PM To: Hugo Lewi Hammer [hugoh@oslomet.no](mailto:hugoh@oslomet.no) Cc: Oskar Pieniak [s377117@oslomet.no](mailto:s377117@oslomet.no) Subject: Estimating Gaussian model parameters with missing entries in training data

Hi Hugo, One of the first tests we plan to run with the stroke data (we received a small sample from the Validate project that we can start working with) is to estimate the parameters of a Gaussian distribution, compute the conditional distributions for incomplete test inputs, and sample from the conditional distribution. This would be the first, rudimentary "generative model" and we will compare later developments against it.

What I am not clear about is how to estimate such parameters when most entries in the training data have one or more missing variables. Do you have any suggestions?

--

Best regards,

Luis Miguel Lopez-Ramos

Postdoctoral researcher, SimulaMet

POskar commented 7 months ago

Working on building a GMM model that would allow generating synthetic data based on sampling.

POskar commented 5 months ago

Created both versions of Conditional GMM, one with a single component (Multivariate normal distribution) and one with number of components based on BIC score.