Open gerkovink opened 3 years ago
Thanks for pointing out this problem.
As far as I can see, there is a unwanted interaction between the regularisation techniques you've implemented in mice.impute.norm()
and the separate internal function remove.lindep()
. The latter "wins" because it runs before mice.impute.norm()
, so mice.impute.norm()
never gets to see the problematic data.
I once created remove.lindep()
as a hack because mice()
frequently produced error messages like Lapack routine dgesv: system is exactly singular
, Error in solve.default(xtx + diag(pen))
and similar. The code of remove.lindep()
is the worst part of mice
(slow and naive), but it is being run for every variable at every iteration. If you have many variables, remove.lindep()
may easily take 90% of the processor time used by mice()
.
So let's make a plan to retire remove.lindep()
.
I did an experiment by setting eps = 0
as default, and rebuilt the package. This introduces a few new error messages, but they seems all solvable. We also need to introduce additional tests for less-frequently used imputation methods, and investigate how they behave under duplicate/highly correlated variables.
I'm not sure where this particular issue is on the roadmap, but I am working on some code to do high-dimensional mass imputation using a reimplementation of mice.impute.cart()
built on the ranger package instead of rpart. That function zips, but I've done some profiling and remove.lindep()
is just a huge computational drain that's basically irrelevant to those kinds of models. Passing in eps = 0
helps a lot, but even with that, the following x <- x[keep, , drop = FALSE]
is nontrivial with large datasets.
I realize my scenario might be an edge case, and I can hack around this with a fork or maybe replacing sampler.univ()
with my own version at runtime. But having come across this issue, I figured I'd put in a plug for ditching remove.lindep()
and see if there were any plans to scrap it in any upcoming releases.
Yes, retiring remove.lindep()
is still on the agenda, but as yet there is no a generic remover()
function. The remover should be able to
The implementation will need to be integrated well into various parts in mice
, and eliminate the need for esoteric and ad-hoc flags like remove.collinear
, eps
, threshold
, ridge
and the like.
Adding a remover is separate from upgrading mice.impute.cart()
with ranger
for speed. I would welcome a pull request along the same lines as the updragde for ranger
in mice.impute.rf()
. I would like to see evidence that ranger
has the same statistical properties as cart
, and does not cut corners in the method.
I'll have to look at how the ranger option in mice.impute.rf()
is implemented. If ranger
has already been added to mice.impute.rf()
, the functionality might already exist and it could just be a matter of supplying the right arguments to ranger to make it fit a single tree with behavior similar to rpart
.
ranger
's algorithm isn't exactly the same as rpart
for a single tree (e.g. there is no cp
parameter), and so far in my own testing, I've found it to be a little more variable than mice.impute.cart
but much faster. I'm still testing, but once I'm satisfied that everything works as intended, I'd be happy to submit a pull request.
Did some experiments to bypass remove.lindep()
in two examples with data problems.
library(mice, warn.conflicts = FALSE)
imp <- mice(selfreport, m = 1, maxit = 1, print = FALSE)
#> Warning: Number of logged events: 7
imp$loggedEvents
#> it im dep meth out
#> 1 0 0 constant pop
#> 2 0 0 constant prg
#> 3 1 1 hm pmm srcmgg, etnUnknown, webYes
#> 4 1 1 wm pmm srcmgg, etnUnknown, webYes
#> 5 1 1 edu polyreg srcmgg, wr, etnUnknown
#> 6 1 1 etn polyreg srcmgg, wr
#> 7 1 1 bm pmm srcmgg, etnUnknown, webYes
imp <- mice(selfreport, m = 1, maxit = 1, print = FALSE, eps = 0)
#> Error in solve.default(xtx + diag(pen)): Lapack routine dgesv: system is exactly singular: U[13,13] = 0
imp <- mice(boys[boys$gen != "G5", ], m = 1, maxit = 1, print = FALSE)
#> Warning: Number of logged events: 8
imp$loggedEvents
#> it im dep meth out
#> 1 1 1 age pmm gen.L
#> 2 1 1 hgt pmm gen.L
#> 3 1 1 wgt pmm gen.L
#> 4 1 1 bmi pmm gen.L
#> 5 1 1 hc pmm gen.L
#> 6 1 1 phb polr gen.L
#> 7 1 1 tv pmm gen.L
#> 8 1 1 reg polyreg gen.L
imp <- mice(boys[boys$gen != "G5", ], m = 1, maxit = 1, print = FALSE, eps = 0)
#> Warning: Number of logged events: 6
imp$loggedEvents
#> it im dep meth
#> 1 1 1 age pmm
#> 2 1 1 hgt pmm
#> 3 1 1 wgt pmm
#> 4 1 1 bmi pmm
#> 5 1 1 hc pmm
#> 6 1 1 tv pmm
#> out
#> 1 mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?
#> 2 mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?
#> 3 mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?
#> 4 mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?
#> 5 mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?
#> 6 mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?
Created on 2023-07-22 with reprex v2.0.2
In the selfreport
example, there is an unused third category that causes problems. remove.lindep()
removes the category EtnUnknown
when etn
is used as predictor. Removing separate categories effectively recodes the omitted category to the reference category.
Setting eps = 0
bypasses remove.lindep()
, and crashes estimice()
because the estimate for the EtnUnkown
category is NA
. We need a fix that is robust against this type of data problem.
I tried to reproduce the problem in another example, the boys
data with one category removed. The behaviour is different though. remove.lindep()
removes the category, as before. Setting eps = 0
signals a problem for method pmm
, as intended through estimice.
It is yet not quite clear 1) what is different from the selfreport
data, and 2) how polr
and polyreg
handle the problem.
There seems to be a reporting inconsistency when the
predictorMatrix
after imputation seems to indicate that a variable is included as a predictor, whileremove.lindep()
has effectively removed it. This leads to biased results due to an uncongenial imputation model, but may lull the user into thinking that the model is correctly specified. See below, whereimp1
is wrong andimp2
is correct, but they have the samepredictorMatrix
printed.Created on 2021-01-21 by the reprex package (v0.3.0)