amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
441 stars 107 forks source link

Major update that improves support for formulas specification #582

Open stefvanbuuren opened 1 year ago

stefvanbuuren commented 1 year ago
stefvanbuuren commented 1 year ago

Ideas for further development:

stefvanbuuren commented 1 year ago
stefvanbuuren commented 1 year ago

Commits 5c6bee2 and 755c23a generalise the classic behaviour of the predictorMatrix to blocks.

It works as follows:

This PR also removes the error message mice detected constant and/or collinear variables. No predictors were left after their removal. Imputations will be generated without predictors by the intercept-only imputation model (not recommended in general).

WARNING: Setting predictorMatrix[v, ] <- 0 does not prevent imputation of variable v. To prevent imputation of v, specify the appropriate entry of method as "".

stefvanbuuren commented 1 year ago

Commit c2da03c cleans up the internal function edit.setup(). It return the proper formulas of the reduced model, but it is not quite right for meth, vis and post. Added FIXME.

stefvanbuuren commented 1 year ago

New behaviours

  1. Prevention of NA propagation by removing incomplete predictors. This version detects when a predictor contains missing values that are not imputed. In order to prevent NA propagation, mice() does the following actions: 1) removes incomplete predictor(s) from the RHS, 2) adds incomplete predictor(s) to formulas (var ~ 1) and block components, sets method[var] = "", and sets the predictorMatrix column and row to zero

  2. The predictorMatrix input can be a square submatrix of the full predictorMatrix. mice() will augment predictorMatrix to the full matrix and always return a p * p named matrix corresponding to the p columns in the data. The inactive variables will have zero columns and rows.

  3. The predictorMatrix input may be unnamed if its size is p p. For other than p p, an unnamed matrix generated an error.

Changes

stefvanbuuren commented 1 year ago

Exit checks added:

stefvanbuuren commented 1 year ago

New behaviours and features thus far

  1. TWO SEPARATE INTERFACES FOR MODEL SPECIFICATION: This version promotes two interfaces to specify imputations models: predictor (predictorMatrix + parcel + method) and formula (formulas + method). This version does not accept anymore accept mixes of predictorMatrix and formulas arguments in the call to mice().

  2. NA-PROPAGATION PREVENTION. This version detects when a predictor contains missing values that are not imputed. In order to prevent NA propagation, mice() can follow two strategies: "Autoremove" (remove incomplete predictor(s) from the RHS, set method to "", adapt predictorMatrix, formulas and blocks, write to loggedEvents), or "Autoimpute" (Impute incomplete predictor and adapt method, predictorMatrix, formulas, and so on). "Autoremove" is implemented and current default. Use mice(..., autoremove = FALSE) to revert to old behavior (NA propagation).

  3. SUBMODELS: The predictorMatrix input can be a square submatrix of the full predictorMatrix when its dimensions are named. mice() will augment the tiny predictorMatrix to the full matrix and always return a p * p named matrix corresponding to the p columns in the data. Unmentioned variables are not imputed, and the predictorMatrix, formulas and method are adapted accordingly.

  4. DROP NON-SQUARE PREDICTOR MATRIX: Version 3.0 introduced non-square versions, but its interpretation turned out to be complex and ambiguous. For clarity, this update works with a predictor matrix that is square with both dimensions identically named with the names of the variables in the data. Variable groups are now specified through the parcel argument.

  5. NEW PARCEL ARGUMENT. There is a new parcel argument that is easier to use. The print of the mids object shows parcel when it is different from the default. parcel can take over the role of blocks in specification. blocks is soft-deprecated, but still widely used within the program code.

  6. NEW DOTS ARGUMENT. The blots argument is renamed to dots

  7. EXIT VALIDATION: Adds a new validate.mids() checks the mids object before exit.

stefvanbuuren commented 1 year ago

Three proposed changes to new behaviour

  1. NA-PROPAGATION. It is better to use NA-PROPAGATION by default. The reason is that the user becomes aware of a potential model specification problem (e.g. not imputing a variable used as a predictor). mice() should offer two easy ways to solve the problem: "autoremove" and "autoimpute". We prefer the NA-PROPAGATION default because it alerts the user, whereas the other two options would "magically" make the problem disappear (and thereby downgrade model specification hygiene).

  2. The formula of a complete variable is now something like age ~ 1. It is better to use age ~ 0, to signal that for the dependent not even the intercept-only model is used.

  3. The formulas argument return as environment attached to the each formula. This environment does not seem to necessary in mice(), so it is cleaner to remove environment.