amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
437 stars 107 forks source link

Multivariate PMM #429

Closed prockenschaub closed 2 years ago

prockenschaub commented 3 years ago

Background

I am working a lot with routinely collected hospital data. Among other things, this type of data contains laboratory measurements that are often measured as panels (i.e., they are present or absent together). A good example of this are full blood counts (platelets, white blood cells, red blood cells, haemoglobin, ....). If a full blood count was performed, these parameters are usually all measured. If no blood count was performed, none of those values are available.

Problem statement

If I want to impute full blood count using predictive mean matching (PMM), I currently need to do so univariately. This works in principle but needs some tweaking of the predictorMatrix, as many of its components are strongly correlated, which can lead to non-convergence. Furthermore, imputing values univariately may fail to preserve any (hypothetical) joint distribution of those values.

Potential solution

In chapter 4.7.2. of van Burren (2018), @stefvanbuuren suggests a multivariate generalisation of the PMM algorithm that may be used within blocks. This method isn't currently implemented in mice. As part of a project, I have implemented a prototype of multivariate PMM following the guidance in Little (1988).

Questions

  1. Is there an appetite to make this algorithm available within mice?
  2. If yes, does the approach taken by me seem sensible? Could the design of the function (or the handling of blocks in mice in general) be further improved? For example, it currently only works with formulas (due to a similar reason that causes an #379 )

References

Van Buuren, S. (2018). Flexible Imputation of Missing Data. Second Edition. Chapman & Hall/CRC. Boca Raton, FL.

Little, R. J. A. 1988. “Missing-Data Adjustments in Large Surveys (with Discussion).” Journal of Business Economics and Statistics 6 (3): 287–301.

gerkovink commented 3 years ago

I think this is useful. Not exactly sure how you do the matching yet, but @Mingyang-Cai has developed methodology to do multivariate imputation by means of canonical regression analysis. Seems like a solution for your motivating example, too

prockenschaub commented 3 years ago

My preliminary solution to matching the mean vectors has been a k-nearest neighbour approach via the RANN package. Little (1988) suggests scaling the predicted means by their standard deviation, which I have chosen as the default but can be deactivated via scale=FALSE.

One aspect I am currently struggling with is how to exhaustively evaluate my implementation to make sure it returns sensible results. If someone has suggestions on how to do this, I would be all ears!

Very interested also to see the canonical regression approach and compare the results.

gerkovink commented 2 years ago

See #460

stefvanbuuren commented 2 years ago

Closing because there is now mice.impute.mpmm(). Feel free to reopen for other ideas on implementation.