Closed POskar closed 5 months ago
Hi Sounds like a good plan. Given a data matrix n x p (n observations of dimension p), estimation of a multivariate gaussian distribution involves one column at the time to estimate the mean vector and two columns at the time to estimate the covariances. A standard procedure to estimate a mean value is to take the average of the values that are not missing. Similar when estimating a covariance, estimate over the pairs with no missing values. In R it is the function FitMVN that I think handles missing values in this way.
Else I think the Expectation-Maximization algorithm is a well-known and efficient algorithm for randomly missing values
I think the function em.norm in the norm package is an implementation of this method: https://cran.r-project.org/web/packages/norm/norm.pdf
Best regards / beste hilsen
Hugo
From: Luis M. Lopez-Ramos [luis@simula.no](mailto:luis@simula.no) Sent: Friday, January 26, 2024 4:49 PM To: Hugo Lewi Hammer [hugoh@oslomet.no](mailto:hugoh@oslomet.no) Cc: Oskar Pieniak [s377117@oslomet.no](mailto:s377117@oslomet.no) Subject: Estimating Gaussian model parameters with missing entries in training data
Hi Hugo, One of the first tests we plan to run with the stroke data (we received a small sample from the Validate project that we can start working with) is to estimate the parameters of a Gaussian distribution, compute the conditional distributions for incomplete test inputs, and sample from the conditional distribution. This would be the first, rudimentary "generative model" and we will compare later developments against it.
What I am not clear about is how to estimate such parameters when most entries in the training data have one or more missing variables. Do you have any suggestions?
--
Best regards,
Luis Miguel Lopez-Ramos
Postdoctoral researcher, SimulaMet
Working on building a GMM model that would allow generating synthetic data based on sampling.
Created both versions of Conditional GMM, one with a single component (Multivariate normal distribution) and one with number of components based on BIC score.
Statistical-based models: kernel density estimation gaussian mixed model