amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
441 stars 107 forks source link

remove.lindep() can lead to inconsistent predictorMatrix printing in the mids object #306

Open gerkovink opened 3 years ago

gerkovink commented 3 years ago

There seems to be a reporting inconsistency when the predictorMatrix after imputation seems to indicate that a variable is included as a predictor, while remove.lindep() has effectively removed it. This leads to biased results due to an uncongenial imputation model, but may lull the user into thinking that the model is correctly specified. See below, where imp1 is wrong and imp2 is correct, but they have the same predictorMatrix printed.

library(mice, warn.conflicts = FALSE)
library(dplyr) # data manipulation
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(MASS) # for mvrnorm()
#> 
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#> 
#>     select
set.seed(123)
data <- mvrnorm(2000, mu = c(1, 2), Sigma = matrix(c(4, 4.8, 4.8, 9), 2, 2))
x <- data[, 1]
y <- data[, 2]
data <- data.frame(
  x = x,
  y = y,
  z = y 
)

# true parameters
lm(y ~ x + z, data = data)
#> 
#> Call:
#> lm(formula = y ~ x + z, data = data)
#> 
#> Coefficients:
#> (Intercept)            x            z  
#>   2.302e-15    3.804e-15    1.000e+00

# make data incomplete cf. MNAR based on z (which is copy of y)
incomplete.data <- ampute(data, 
                          prop = 0.7, 
                          mech = "MAR", 
                          type = "RIGHT", 
                          patterns = c(1, 0, 1), 
                          weights = c(0, 0, 1))$amp

# imputation without collinear
imp <- mice(incomplete.data, 
            ls.meth = "qr", 
            meth = "norm", 
            print = FALSE)
#> Warning: Number of logged events: 1
imp$pred
#>   x y z
#> x 0 0 1
#> y 0 0 0
#> z 1 0 0
with(imp, lm(y ~ x + z)) %>% pool %>% summary
#>          term      estimate    std.error     statistic       df    p.value
#> 1 (Intercept) -1.766515e-16 1.005331e-16 -1.757149e+00 562.9535 0.07943602
#> 2           x -1.401097e-15 8.113521e-17 -1.726867e+01 562.9535 0.00000000
#> 3           z  1.000000e+00 5.678926e-17  1.760896e+16 562.9535 0.00000000

# keep all collinear columns
imp1 <- mice(incomplete.data, 
            remove.collinear = "FALSE", 
            threshold = 1.1, 
            ls.meth = "qr", 
            meth = "norm", 
            print = FALSE)
imp1$pred
#>   x y z
#> x 0 1 1
#> y 1 0 1
#> z 1 1 0
with(imp1, lm(y ~ x + z)) %>% pool %>%  summary
#>          term    estimate  std.error  statistic         df      p.value
#> 1 (Intercept) -0.08736438 0.11818690 -0.7392053   5.495134 4.901374e-01
#> 2           x  0.76055988 0.04452898 17.0801106  17.205011 3.159029e-12
#> 3           z  0.28608590 0.02302393 12.4255868 226.274361 0.000000e+00

# bypass lindep
imp2 <- mice(incomplete.data, 
            remove.collinear = "FALSE", 
            threshold = 1.1, 
            ls.meth = "qr", 
            meth = "norm", 
            print = FALSE, 
            eps = 0)
imp2$pred
#>   x y z
#> x 0 1 1
#> y 1 0 1
#> z 1 1 0
with(imp2, lm(y ~ x + z)) %>% pool %>% summary
#>          term     estimate    std.error    statistic       df     p.value
#> 1 (Intercept) 1.408743e-15 8.607085e-16 1.636725e+00 4.091456 0.175436545
#> 2           x 2.853085e-15 4.895656e-16 5.827789e+00 4.322090 0.003384635
#> 3           z 1.000000e+00 3.420807e-16 2.923286e+15 4.267820 0.000000000

Created on 2021-01-21 by the reprex package (v0.3.0)

stefvanbuuren commented 3 years ago

Thanks for pointing out this problem.

As far as I can see, there is a unwanted interaction between the regularisation techniques you've implemented in mice.impute.norm() and the separate internal function remove.lindep(). The latter "wins" because it runs before mice.impute.norm(), so mice.impute.norm() never gets to see the problematic data.

I once created remove.lindep() as a hack because mice() frequently produced error messages like Lapack routine dgesv: system is exactly singular, Error in solve.default(xtx + diag(pen)) and similar. The code of remove.lindep() is the worst part of mice (slow and naive), but it is being run for every variable at every iteration. If you have many variables, remove.lindep() may easily take 90% of the processor time used by mice().

So let's make a plan to retire remove.lindep().

I did an experiment by setting eps = 0 as default, and rebuilt the package. This introduces a few new error messages, but they seems all solvable. We also need to introduce additional tests for less-frequently used imputation methods, and investigate how they behave under duplicate/highly correlated variables.

awmercer commented 1 year ago

I'm not sure where this particular issue is on the roadmap, but I am working on some code to do high-dimensional mass imputation using a reimplementation of mice.impute.cart() built on the ranger package instead of rpart. That function zips, but I've done some profiling and remove.lindep() is just a huge computational drain that's basically irrelevant to those kinds of models. Passing in eps = 0 helps a lot, but even with that, the following x <- x[keep, , drop = FALSE] is nontrivial with large datasets.

I realize my scenario might be an edge case, and I can hack around this with a fork or maybe replacing sampler.univ() with my own version at runtime. But having come across this issue, I figured I'd put in a plug for ditching remove.lindep() and see if there were any plans to scrap it in any upcoming releases.

stefvanbuuren commented 1 year ago

Yes, retiring remove.lindep() is still on the agenda, but as yet there is no a generic remover() function. The remover should be able to

The implementation will need to be integrated well into various parts in mice, and eliminate the need for esoteric and ad-hoc flags like remove.collinear, eps, threshold, ridge and the like.

Adding a remover is separate from upgrading mice.impute.cart() with ranger for speed. I would welcome a pull request along the same lines as the updragde for ranger in mice.impute.rf(). I would like to see evidence that ranger has the same statistical properties as cart, and does not cut corners in the method.

awmercer commented 1 year ago

I'll have to look at how the ranger option in mice.impute.rf() is implemented. If ranger has already been added to mice.impute.rf(), the functionality might already exist and it could just be a matter of supplying the right arguments to ranger to make it fit a single tree with behavior similar to rpart.

ranger's algorithm isn't exactly the same as rpart for a single tree (e.g. there is no cp parameter), and so far in my own testing, I've found it to be a little more variable than mice.impute.cart but much faster. I'm still testing, but once I'm satisfied that everything works as intended, I'd be happy to submit a pull request.

stefvanbuuren commented 1 year ago

Did some experiments to bypass remove.lindep() in two examples with data problems.

library(mice, warn.conflicts = FALSE)

imp <- mice(selfreport, m = 1, maxit = 1, print = FALSE)
#> Warning: Number of logged events: 7
imp$loggedEvents
#>   it im dep     meth                        out
#> 1  0  0     constant                        pop
#> 2  0  0     constant                        prg
#> 3  1  1  hm      pmm srcmgg, etnUnknown, webYes
#> 4  1  1  wm      pmm srcmgg, etnUnknown, webYes
#> 5  1  1 edu  polyreg     srcmgg, wr, etnUnknown
#> 6  1  1 etn  polyreg                 srcmgg, wr
#> 7  1  1  bm      pmm srcmgg, etnUnknown, webYes

imp <- mice(selfreport, m = 1, maxit = 1, print = FALSE, eps = 0)
#> Error in solve.default(xtx + diag(pen)): Lapack routine dgesv: system is exactly singular: U[13,13] = 0

imp <- mice(boys[boys$gen != "G5", ], m = 1, maxit = 1, print = FALSE)
#> Warning: Number of logged events: 8
imp$loggedEvents
#>   it im dep    meth   out
#> 1  1  1 age     pmm gen.L
#> 2  1  1 hgt     pmm gen.L
#> 3  1  1 wgt     pmm gen.L
#> 4  1  1 bmi     pmm gen.L
#> 5  1  1  hc     pmm gen.L
#> 6  1  1 phb    polr gen.L
#> 7  1  1  tv     pmm gen.L
#> 8  1  1 reg polyreg gen.L

imp <- mice(boys[boys$gen != "G5", ], m = 1, maxit = 1, print = FALSE, eps = 0)
#> Warning: Number of logged events: 6
imp$loggedEvents
#>   it im dep meth
#> 1  1  1 age  pmm
#> 2  1  1 hgt  pmm
#> 3  1  1 wgt  pmm
#> 4  1  1 bmi  pmm
#> 5  1  1  hc  pmm
#> 6  1  1  tv  pmm
#>                                                                                                                                                                                                                                                        out
#> 1 mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?
#> 2 mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?
#> 3 mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?
#> 4 mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?
#> 5 mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?
#> 6 mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?

Created on 2023-07-22 with reprex v2.0.2

In the selfreport example, there is an unused third category that causes problems. remove.lindep() removes the category EtnUnknown when etn is used as predictor. Removing separate categories effectively recodes the omitted category to the reference category.

Setting eps = 0 bypasses remove.lindep(), and crashes estimice() because the estimate for the EtnUnkown category is NA. We need a fix that is robust against this type of data problem.

I tried to reproduce the problem in another example, the boys data with one category removed. The behaviour is different though. remove.lindep() removes the category, as before. Setting eps = 0 signals a problem for method pmm, as intended through estimice.

It is yet not quite clear 1) what is different from the selfreport data, and 2) how polr and polyreg handle the problem.