Closed prockenschaub closed 2 years ago
I think this is useful. Not exactly sure how you do the matching yet, but @Mingyang-Cai has developed methodology to do multivariate imputation by means of canonical regression analysis. Seems like a solution for your motivating example, too
My preliminary solution to matching the mean vectors has been a k-nearest neighbour approach via the RANN package. Little (1988) suggests scaling the predicted means by their standard deviation, which I have chosen as the default but can be deactivated via scale=FALSE
.
One aspect I am currently struggling with is how to exhaustively evaluate my implementation to make sure it returns sensible results. If someone has suggestions on how to do this, I would be all ears!
Very interested also to see the canonical regression approach and compare the results.
See #460
Closing because there is now mice.impute.mpmm()
. Feel free to reopen for other ideas on implementation.
Background
I am working a lot with routinely collected hospital data. Among other things, this type of data contains laboratory measurements that are often measured as panels (i.e., they are present or absent together). A good example of this are full blood counts (platelets, white blood cells, red blood cells, haemoglobin, ....). If a full blood count was performed, these parameters are usually all measured. If no blood count was performed, none of those values are available.
Problem statement
If I want to impute full blood count using predictive mean matching (PMM), I currently need to do so univariately. This works in principle but needs some tweaking of the
predictorMatrix
, as many of its components are strongly correlated, which can lead to non-convergence. Furthermore, imputing values univariately may fail to preserve any (hypothetical) joint distribution of those values.Potential solution
In chapter 4.7.2. of van Burren (2018), @stefvanbuuren suggests a multivariate generalisation of the PMM algorithm that may be used within blocks. This method isn't currently implemented in
mice
. As part of a project, I have implemented a prototype of multivariate PMM following the guidance in Little (1988).Questions
mice
?mice
in general) be further improved? For example, it currently only works with formulas (due to a similar reason that causes an #379 )References
Van Buuren, S. (2018). Flexible Imputation of Missing Data. Second Edition. Chapman & Hall/CRC. Boca Raton, FL.
Little, R. J. A. 1988. “Missing-Data Adjustments in Large Surveys (with Discussion).” Journal of Business Economics and Statistics 6 (3): 287–301.