Closed thomvolker closed 1 year ago
I agree; let's discuss and study the impact before we change.
If I understand correctly
data[where[, j], j] <- imp[[j]][, i]
updates the j'th column in the working data within the i'th impution for cells as specified in where
; data[(!r[, j]) & where[, j], j] <- imp[[j]][, i]
updates only those cells that are also missing.I would be inclined to go for option 2 because that is consistent with line 85. Hence option 2 only affects the same cells in the working data, and prevent overwriting of observed data. I think the RHS should then also be something like imp[[j]][(!r[, j])[where[, j]], i]
in order to have the same length of vectors.
Could we come up with a small numerical example to decide on which option is most sensible?
When imputing all or a subset of the missing values, the two are equivalent, so it does not pose any issues here. Only when overimputing certain cells, the two ways of doing this will give different results. In the case of synthetic data, I think you want to work with the observed data, because that is how all other variables are synthesized as well. Changing the values in the variable $j$ of the data object de facto changes the data used in the synthesis models of all variables that are synthesized after $j$, relative to the data that is used to synthesize the variables before $j$.
In practice, I'm not sure whether it makes a lot of difference (as long as the synthesis model is reasonably good), but it might add some additional variance.
We can simulate both scenarios for synthetic data and see.
@thomvolker Agree. There is a difference only in the case of overimputation.
I interpret your response as support for option 2. Option 1 overwrite the observed data. I cannot (yet) think of a scenario where we would want that.
Do we have a test of post-processing combined with overimputation?
I think that LHS and RHS vector have different lengths, resulting in an error.
I indeed support option two.
The one scenario I can think of is that you know (a) that your imputation model is biased, and (b) that you know exactly in what way it is biased, such that you can correct for it. I am not sure whether this ever occurs in practice.
You are correct with respect to the LHS and RHS being different, producing an error. I have never noticed this because I only used post-processing in combination with complete overimputation, in which case the LHS is an empty vector (and the length of the RHS does not matter anymore).
Commit 889d824 sets option 2, and makes the relevant lines identical to lines 85/86.
This commit changes the post-processing command in the sampler to its previous state. I still believe we have to discuss this matter, but this fix at least fixes potential backward compatibility issues for now. We can make an informed decision later.