Imputing categorical data by predictive mean matching

stefvanbuuren commented 1 year ago

Predictive mean matching (PMM) is the default method of mice for imputing numerical variables, but it has long been possible to impute factors. This PR introduces better support to work with categorical variables in PMM.

The former system worked as follows: If we specify PMM for an unordered factor, then the similarity among potential donors is expressed on the linear predictor, and we take the observed category of a random draw among the five closest donor cases. As the linear predictor summarizes the available predictive information, matching should produce reasonable imputations. This method is fast and robust against empty cell and fitting problems. The downside is that it depends on category order. In particular, in mice.impute.pmm() we have the shortcut:

 ynum <- y
  if (is.factor(y)) {
    ynum <- as.integer(y)
  }

The order of integers in ynum may have no sensible interpretation for an unordered factor. The problem is less likely to surface for ordered factors, though there is still the assumption that the categories are equidistant.

The new system quantifies ynum and could yield better results because of higher $R^2$. The PR follows a similar strategy as Frank Harrell's function Hmisc::aregImpute(). The method calculates the canonical correlation between y (as dummy matrix) and a linear combination of imputation model predictors x. Similar methods are known as MORALS (Gifi, 1980) or ACE (Breiman and Friedman, 1985). The algorithm then replaces each category of y by a single number taken from the first canonical variate. After this step, the imputation model is fitted, and the predicted values from that model are extracted to function as the similarity measure for the matching step.

The method works for both ordered and unordered factors. No special precautions are taken to ensure monotonicity between the category numbers and the quantifications, so the method should be able to preserve quadratic and other non-monotone relations of the predicted metric. It may be beneficial to remove very sparsely filled categories, for which there is a new trim argument.

Potential advantages are:

Simpler and faster than fitting a generalised linear model, e.g., logistic regression or the proportional odds model;
Should be insensitive to the order of categories;
No need to solve problems with perfect prediction;
Should inherit the good statistical properties of predictive mean matching.

Note that we still lack solid evidence for these claims.

Here are some examples for the new functionality.

library(mice, warn.conflicts = FALSE)

# Factor quantification
xname <- c("age", "hgt", "wgt")
br <- boys[c(1:10, 101:110, 501:510, 601:620, 701:710), ]
r <- stats::complete.cases(br[, xname])
x <- br[r, xname]
y <- factor(br[r, "tv"])
ry <- !is.na(y)
table(y)
#> y
#>  6  8 10 12 13 15 16 20 25 
#>  1  2  1  1  1  4  1  4  7

# Impute factor by optimizing canonical correlation y, x
mice.impute.pmm(y, ry, x)
#>  [1] 15 25 25 8  20 8  25 8  25 25 20 8  25 8  25 15 20 15 15 25 20 8  16 25 15
#> [26] 25 20 25 20 8  20 25 20 20 15 8  20 15
#> Levels: 6 8 10 12 13 15 16 20 25

# Only categories with at least 2 cases can be donor
mice.impute.pmm(y, ry, x, trim = 2L)
#>  [1] 8  20 8  15 15 25 20 8  8  20 8  8  25 15 15 20 15 8  25 8  20 20 25 25 25
#> [26] 25 15 25 20 20 8  25 8  25 15 25 25 25
#> Levels: 6 8 10 12 13 15 16 20 25

# In addition, eliminate categories 8 and 20
mice.impute.pmm(y, ry, x, trim = 2L, exclude = c(8, 20))
#>  [1] 15 15 25 25 15 25 25 25 15 15 25 25 25 25 25 15 25 25 25 25 25 25 25 25 15
#> [26] 25 15 25 15 25 25 25 15 25 25 25 25 25
#> Levels: 6 8 10 12 13 15 16 20 25

# To get old behavior: as.integer(y))
mice.impute.pmm(y, ry, x, quantify = FALSE)
#>  [1] 8  15 12 8  8  12 8  15 8  12 12 10 6  12 12 6  15 8  6  6  15 15 6  10 15
#> [26] 15 10 25 20 15 10 25 8  16 25 25 25 25
#> Levels: 6 8 10 12 13 15 16 20 25

^{Created on 2023-08-07 with reprex v2.0.2}

stefvanbuuren commented 1 year ago

Related #196, #268

stefvanbuuren commented 1 year ago

Some points to myself to consider:

stats::cancor() executes before mice:::estimice(), so we may need to make it more robust and flag loggedEvents in case of multi-collinear data;
stats::cancor() and mice:::estimice() do partly similar calculations, so we may potentially speed up by re-using parts of cancor();
Perhaps it is enough to quantify only at initialisation, and fix the quantifications during iteration;
It is not yet clear whether we need additional procedures to incorporate the uncertainty of quantifications;
Would similar strategies works for mice::quickpred() and mice:::remove_lindep()?

hanneoberman commented 1 year ago

Clear potential for improved support for categorical variables. Earlier comment below about recreating old behaviour may be disregarded!

However, I'm wondering when exactly the added value would show up. When reproducing the example, I do not see different results with the new implementation versus the 'legacy code' (specified with quantify = FALSE):

# recreating reprex data
library(mice, warn.conflicts = FALSE)
xname <- c("age", "hgt", "wgt")
br <- boys[c(1:10, 101:110, 501:510, 601:620, 701:710), ]
r <- stats::complete.cases(br[, xname])
x <- br[r, xname]
y <- factor(br[r, "tv"])

# imputing with new and old behaviour
dat <- cbind(y, x)
imp_can <- mice(dat, method = "pmm", printFlag = FALSE, seed = 123)
imp_old <- mice(dat, method = "pmm", printFlag = FALSE, seed = 123, quantify = FALSE)
all.equal(imp_can$imp$y, imp_old$imp$y)
#> [1] TRUE

^{Created on 2023-08-08 with reprex v2.0.2}

How high should $R^2$ be before this new implementation provides more efficient imputations? Or did I not specify the old behavior in the right way?

Update: reprex did not use the right package version.

# recreating reprex data
library(mice, warn.conflicts = FALSE)
xname <- c("age", "hgt", "wgt")
br <- boys[c(1:10, 101:110, 501:510, 601:620, 701:710), ]
r <- stats::complete.cases(br[, xname])
x <- br[r, xname]
y <- factor(br[r, "tv"])

# imputing with new and old behaviour
dat <- cbind(y, x)
imp_can <- mice(dat, method = "pmm", printFlag = FALSE, seed = 123)
imp_old <- mice(dat, method = "pmm", printFlag = FALSE, seed = 123, quantify = FALSE)
complete(imp_can)$y
#>  [1] 25 20 25 25 20 20 20 25 20 25 15 25 20 8  8  20 25 25 25 8  8  8  15 8  10
#> [26] 16 15 20 25 12 13 15 20 25 15 15 25 20 10 6  25 20 25 8  20 25 20 25 25 25
#> [51] 25 16 16 16 13 20 13 15 25 25
#> Levels: 6 8 10 12 13 15 16 20 25
complete(imp_old)$y
#>  [1] 10 15 6  15 6  6  15 12 6  8  10 6  15 8  8  8  12 8  10 6  8  8  15 12 10
#> [26] 15 15 20 15 12 13 15 20 25 15 15 25 20 25 6  25 20 25 25 15 25 20 15 25 25
#> [51] 25 16 16 25 16 16 25 15 25 25
#> Levels: 6 8 10 12 13 15 16 20 25

^{Created on 2023-08-08 with reprex v2.0.2}

stefvanbuuren commented 1 year ago

@hanneoberman Thanks.

How high should be before this new implementation provides more efficient imputations?

Hard to answer in general. We would expect larger differences for variables whose integer category ordering is wrong is some way. For example, physical strength and age has a curve-linear relation. If age is coded as young-middle-old, then imputing age | strength or strength | age using as.integer() has an attenuated slope. The optimal scaling step may quantify them with a different order , e.g. young-old-middle, and hence increase $R^2$. Also, there are many variables that have no inherent order, like color, religion, and postal code that we may want to impute. Scaling these may also result in a better predictive model.

stefvanbuuren commented 1 year ago

stats::cancor() executes before mice:::estimice(), so we may need to make it more robust and flag loggedEvents in case of multi-collinear data;

Hmm, we definitely need to increase the robustness of stats::cancor() before merging. https://stackoverflow.com/questions/5850763/canonical-correlation-analysis-in-r

stefvanbuuren commented 1 year ago

I wrote and executed various tests aimed at testing and breaking stats::cancor(), but it held up very well. Some findings:

Adding one or many duplicates in the x variables had no impact on the transform;
Adding many junk variables to x changes the transform, but does not crash cancor();
Setting defaultMethod all to "pmm" and running all tests did not reveal problems related to cancor();
Setting method = "pmm" and running all tests did not reveal problems related to cancor();

All in all, I believe that cancor() handles crappy data quite well. Lack of robustness does not seem to be an issue that should uphold the merge in the mice master branch.

amices / mice

Imputing categorical data by predictive mean matching #576