Major update that improves support for formulas specification

stefvanbuuren commented 1 year ago

reintroduces the square predictorMatrix
defines conversion functions p2f(), p2c(), f2p(), n2b(), b2n()
defines validate.blocks(), validate.predictorMatrix()
extends edit.setup() to formulas and blots
for reading ease, use ~ 1 for the empty predictor set instead of ~ 0
does not automatically set method = "" for variables that are not imputed (NOTE: DECISION REVERTED. SEE BELOW)
as far as possible, changes the leading argument to formulas (instead of blocks or predictorMatrix)
adds function typecodes() in sampler() to reduce multiple predictorMatrix lines to one (support for multivariate imputation methods)
implement new logic in samper.univ()
outcomments some tests that depend on hard-coded parameter estimates
sharpens test for equality between predictorMatrix and formulas specifications

stefvanbuuren commented 1 year ago

Ideas for further development:

add news function to YAML so that they appear on site
soft replace of blocks by nest (character vector with length ncol(data) with block names. The default is colnames(data))
Provide a way for the user to see head of design matrix created in sampler.univ(). Add examples that exploit formulas to add interactions, nested variables, by-processing and other advanced models
Describe differences and equivalences between predictorMatrix and formulas specification
...

stefvanbuuren commented 1 year ago

In preparation to tweaking documentation, converts Rd tags to roxygen2 tags.
Adds new functions to YAML

stefvanbuuren commented 1 year ago

Commits 5c6bee2 and 755c23a generalise the classic behaviour of the predictorMatrix to blocks.

It works as follows:

mice() uses the nimp() function to calculate the number of imputations needed for a given block of variables;
if the number of needed imputations in block j is zero, the following happens: 1) mice() sets method[j] <- "" 2) mice() sets predictorMatrix[v, ] <- 0 for all variables v in block j

This PR also removes the error message mice detected constant and/or collinear variables. No predictors were left after their removal. Imputations will be generated without predictors by the intercept-only imputation model (not recommended in general).

WARNING: Setting predictorMatrix[v, ] <- 0 does not prevent imputation of variable v. To prevent imputation of v, specify the appropriate entry of method as "".

stefvanbuuren commented 1 year ago

Commit c2da03c cleans up the internal function edit.setup(). It return the proper formulas of the reduced model, but it is not quite right for meth, vis and post. Added FIXME.

stefvanbuuren commented 1 year ago

New behaviours

Prevention of NA propagation by removing incomplete predictors. This version detects when a predictor contains missing values that are not imputed. In order to prevent NA propagation, mice() does the following actions: 1) removes incomplete predictor(s) from the RHS, 2) adds incomplete predictor(s) to formulas (var ~ 1) and block components, sets method[var] = "", and sets the predictorMatrix column and row to zero
The predictorMatrix input can be a square submatrix of the full predictorMatrix. mice() will augment predictorMatrix to the full matrix and always return a p * p named matrix corresponding to the p columns in the data. The inactive variables will have zero columns and rows.
The predictorMatrix input may be unnamed if its size is p p. For other than p p, an unnamed matrix generated an error.

Changes

Adds supports a tiny predictorMatrix
Solves bug in f2p()
Adds new function remove.rhs.variables()
Adds a validate.mids() check at exit that errors if rownames(predictorMatrix) differ from colnames(data). Some more output tests need to be added.
Removes codes designed to work specifically with a non-square predictorMatrix
Generates an error if predictorMatrix has fewer rows than length of blocks

stefvanbuuren commented 1 year ago

Exit checks added:

rownames(predictorMatrix) must match colnames(data)
length of formulas and blocks must be equal
length of formulas and method must be equal
length of method vector cannot exceed number of variables
length of imp and number of variables must be equal

stefvanbuuren commented 1 year ago

New behaviours and features thus far

TWO SEPARATE INTERFACES FOR MODEL SPECIFICATION: This version promotes two interfaces to specify imputations models: predictor (predictorMatrix + parcel + method) and formula (formulas + method). This version does not accept anymore accept mixes of predictorMatrix and formulas arguments in the call to mice().
NA-PROPAGATION PREVENTION. This version detects when a predictor contains missing values that are not imputed. In order to prevent NA propagation, mice() can follow two strategies: "Autoremove" (remove incomplete predictor(s) from the RHS, set method to "", adapt predictorMatrix, formulas and blocks, write to loggedEvents), or "Autoimpute" (Impute incomplete predictor and adapt method, predictorMatrix, formulas, and so on). "Autoremove" is implemented and current default. Use mice(..., autoremove = FALSE) to revert to old behavior (NA propagation).
SUBMODELS: The predictorMatrix input can be a square submatrix of the full predictorMatrix when its dimensions are named. mice() will augment the tiny predictorMatrix to the full matrix and always return a p * p named matrix corresponding to the p columns in the data. Unmentioned variables are not imputed, and the predictorMatrix, formulas and method are adapted accordingly.
DROP NON-SQUARE PREDICTOR MATRIX: Version 3.0 introduced non-square versions, but its interpretation turned out to be complex and ambiguous. For clarity, this update works with a predictor matrix that is square with both dimensions identically named with the names of the variables in the data. Variable groups are now specified through the parcel argument.
NEW PARCEL ARGUMENT. There is a new parcel argument that is easier to use. The print of the mids object shows parcel when it is different from the default. parcel can take over the role of blocks in specification. blocks is soft-deprecated, but still widely used within the program code.
NEW DOTS ARGUMENT. The blots argument is renamed to dots
EXIT VALIDATION: Adds a new validate.mids() checks the mids object before exit.

stefvanbuuren commented 1 year ago

Three proposed changes to new behaviour

NA-PROPAGATION. It is better to use NA-PROPAGATION by default. The reason is that the user becomes aware of a potential model specification problem (e.g. not imputing a variable used as a predictor). mice() should offer two easy ways to solve the problem: "autoremove" and "autoimpute". We prefer the NA-PROPAGATION default because it alerts the user, whereas the other two options would "magically" make the problem disappear (and thereby downgrade model specification hygiene).
The formula of a complete variable is now something like age ~ 1. It is better to use age ~ 0, to signal that for the dependent not even the intercept-only model is used.
The formulas argument return as environment attached to the each formula. This environment does not seem to necessary in mice(), so it is cleaner to remove environment.

amices / mice