Decide which cells to update after post-processing

amices / mice

Multivariate Imputation by Chained Equations

https://amices.org/mice/

GNU General Public License v2.0

433 stars 107 forks source link

Decide which cells to update after post-processing #511

Closed thomvolker closed 1 year ago

thomvolker commented 1 year ago

This commit changes the post-processing command in the sampler to its previous state. I still believe we have to discuss this matter, but this fix at least fixes potential backward compatibility issues for now. We can make an informed decision later.

gerkovink commented 1 year ago

I agree; let's discuss and study the impact before we change.

stefvanbuuren commented 1 year ago

If I understand correctly

data[where[, j], j] <- imp[[j]][, i] updates the j'th column in the working data within the i'th impution for cells as specified in where;
data[(!r[, j]) & where[, j], j] <- imp[[j]][, i] updates only those cells that are also missing.

I would be inclined to go for option 2 because that is consistent with line 85. Hence option 2 only affects the same cells in the working data, and prevent overwriting of observed data. I think the RHS should then also be something like imp[[j]][(!r[, j])[where[, j]], i] in order to have the same length of vectors.

Could we come up with a small numerical example to decide on which option is most sensible?

thomvolker commented 1 year ago

When imputing all or a subset of the missing values, the two are equivalent, so it does not pose any issues here. Only when overimputing certain cells, the two ways of doing this will give different results. In the case of synthetic data, I think you want to work with the observed data, because that is how all other variables are synthesized as well. Changing the values in the variable $j$ of the data object de facto changes the data used in the synthesis models of all variables that are synthesized after $j$, relative to the data that is used to synthesize the variables before $j$.

In practice, I'm not sure whether it makes a lot of difference (as long as the synthesis model is reasonably good), but it might add some additional variance.

gerkovink commented 1 year ago

We can simulate both scenarios for synthetic data and see.

stefvanbuuren commented 1 year ago

@thomvolker Agree. There is a difference only in the case of overimputation.

I interpret your response as support for option 2. Option 1 overwrite the observed data. I cannot (yet) think of a scenario where we would want that.

stefvanbuuren commented 1 year ago

Do we have a test of post-processing combined with overimputation?

I think that LHS and RHS vector have different lengths, resulting in an error.

thomvolker commented 1 year ago

I indeed support option two.

The one scenario I can think of is that you know (a) that your imputation model is biased, and (b) that you know exactly in what way it is biased, such that you can correct for it. I am not sure whether this ever occurs in practice.

You are correct with respect to the LHS and RHS being different, producing an error. I have never noticed this because I only used post-processing in combination with complete overimputation, in which case the LHS is an empty vector (and the length of the RHS does not matter anymore).

stefvanbuuren commented 1 year ago

Commit 889d824 sets option 2, and makes the relevant lines identical to lines 85/86.