Closed stefvanbuuren closed 1 year ago
Related #196, #268
Some points to myself to consider:
stats::cancor()
executes before mice:::estimice()
, so we may need to make it more robust and flag loggedEvents
in case of multi-collinear data;stats::cancor()
and mice:::estimice()
do partly similar calculations, so we may potentially speed up by re-using parts of cancor()
;mice::quickpred()
and mice:::remove_lindep()
?Clear potential for improved support for categorical variables. Earlier comment below about recreating old behaviour may be disregarded!
However, I'm wondering when exactly the added value would show up. When reproducing the example, I do not see different results with the new implementation versus the 'legacy code' (specified with
quantify = FALSE
):
# recreating reprex data
library(mice, warn.conflicts = FALSE)
xname <- c("age", "hgt", "wgt")
br <- boys[c(1:10, 101:110, 501:510, 601:620, 701:710), ]
r <- stats::complete.cases(br[, xname])
x <- br[r, xname]
y <- factor(br[r, "tv"])
# imputing with new and old behaviour
dat <- cbind(y, x)
imp_can <- mice(dat, method = "pmm", printFlag = FALSE, seed = 123)
imp_old <- mice(dat, method = "pmm", printFlag = FALSE, seed = 123, quantify = FALSE)
all.equal(imp_can$imp$y, imp_old$imp$y)
#> [1] TRUE
Created on 2023-08-08 with reprex v2.0.2
How high should $R^2$ be before this new implementation provides more efficient imputations? Or did I not specify the old behavior in the right way?
Update: reprex did not use the right package version.
# recreating reprex data
library(mice, warn.conflicts = FALSE)
xname <- c("age", "hgt", "wgt")
br <- boys[c(1:10, 101:110, 501:510, 601:620, 701:710), ]
r <- stats::complete.cases(br[, xname])
x <- br[r, xname]
y <- factor(br[r, "tv"])
# imputing with new and old behaviour
dat <- cbind(y, x)
imp_can <- mice(dat, method = "pmm", printFlag = FALSE, seed = 123)
imp_old <- mice(dat, method = "pmm", printFlag = FALSE, seed = 123, quantify = FALSE)
complete(imp_can)$y
#> [1] 25 20 25 25 20 20 20 25 20 25 15 25 20 8 8 20 25 25 25 8 8 8 15 8 10
#> [26] 16 15 20 25 12 13 15 20 25 15 15 25 20 10 6 25 20 25 8 20 25 20 25 25 25
#> [51] 25 16 16 16 13 20 13 15 25 25
#> Levels: 6 8 10 12 13 15 16 20 25
complete(imp_old)$y
#> [1] 10 15 6 15 6 6 15 12 6 8 10 6 15 8 8 8 12 8 10 6 8 8 15 12 10
#> [26] 15 15 20 15 12 13 15 20 25 15 15 25 20 25 6 25 20 25 25 15 25 20 15 25 25
#> [51] 25 16 16 25 16 16 25 15 25 25
#> Levels: 6 8 10 12 13 15 16 20 25
Created on 2023-08-08 with reprex v2.0.2
@hanneoberman Thanks.
How high should be before this new implementation provides more efficient imputations?
Hard to answer in general. We would expect larger differences for variables whose integer category ordering is wrong is some way. For example, physical strength and age has a curve-linear relation. If age is coded as young-middle-old, then imputing age | strength or strength | age using as.integer()
has an attenuated slope. The optimal scaling step may quantify them with a different order , e.g. young-old-middle, and hence increase $R^2$. Also, there are many variables that have no inherent order, like color, religion, and postal code that we may want to impute. Scaling these may also result in a better predictive model.
stats::cancor() executes before mice:::estimice(), so we may need to make it more robust and flag loggedEvents in case of multi-collinear data;
Hmm, we definitely need to increase the robustness of stats::cancor()
before merging.
https://stackoverflow.com/questions/5850763/canonical-correlation-analysis-in-r
I wrote and executed various tests aimed at testing and breaking stats::cancor()
, but it held up very well. Some findings:
x
variables had no impact on the transform;x
changes the transform, but does not crash cancor()
;defaultMethod
all to "pmm"
and running all tests did not reveal problems related to cancor()
;method = "pmm"
and running all tests did not reveal problems related to cancor()
;All in all, I believe that cancor()
handles crappy data quite well. Lack of robustness does not seem to be an issue that should uphold the merge in the mice master
branch.
Predictive mean matching (PMM) is the default method of
mice
for imputing numerical variables, but it has long been possible to impute factors. This PR introduces better support to work with categorical variables in PMM.The former system worked as follows: If we specify PMM for an unordered factor, then the similarity among potential donors is expressed on the linear predictor, and we take the observed category of a random draw among the five closest donor cases. As the linear predictor summarizes the available predictive information, matching should produce reasonable imputations. This method is fast and robust against empty cell and fitting problems. The downside is that it depends on category order. In particular, in
mice.impute.pmm()
we have the shortcut:The order of integers in
ynum
may have no sensible interpretation for an unordered factor. The problem is less likely to surface for ordered factors, though there is still the assumption that the categories are equidistant.The new system quantifies
ynum
and could yield better results because of higher $R^2$. The PR follows a similar strategy as Frank Harrell's functionHmisc::aregImpute()
. The method calculates the canonical correlation betweeny
(as dummy matrix) and a linear combination of imputation model predictorsx
. Similar methods are known as MORALS (Gifi, 1980) or ACE (Breiman and Friedman, 1985). The algorithm then replaces each category ofy
by a single number taken from the first canonical variate. After this step, the imputation model is fitted, and the predicted values from that model are extracted to function as the similarity measure for the matching step.The method works for both ordered and unordered factors. No special precautions are taken to ensure monotonicity between the category numbers and the quantifications, so the method should be able to preserve quadratic and other non-monotone relations of the predicted metric. It may be beneficial to remove very sparsely filled categories, for which there is a new
trim
argument.Potential advantages are:
Note that we still lack solid evidence for these claims.
Here are some examples for the new functionality.
Created on 2023-08-07 with reprex v2.0.2