Closed ChristosPol closed 2 years ago
It seems that you are imputing the composition as just another variable. It would be correct to impute the deterministic relation via passive imputation. We have a vignette that details passive imputation
@gerkovink Thank you for your comment.
I did try the passive imputation, I am still getting biased results. It must be something else I am missing.
theta_i_imp <- c()
for (i in 1:length(samples_NA)){
data <- copy(samples_NA[[i]])
data$score <- (data$Sepal.Length + data$Sepal.Width + data$Petal.Length + data$Petal.Width)/4
meth <- make.method(data)
meth["score"] <- "~I((Sepal.Length + Sepal.Width + Petal.Length + Petal.Width)/4)"
pred <- make.predictorMatrix(data)
pred[c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"), "score"] <- 0
imp.pas <- mice(data, meth = meth, pred = pred,
print = FALSE)
fit <- with(imp.pas, (Sepal.Length + Sepal.Width + Petal.Length + Petal.Width)/4)
theta_i_imp[i] <- mean(unlist(lapply(fit$analyses, mean)))
print(i)
}
Perhaps you could simplify the problem, and create missing data in only one instead of three variables at a time.
@stefvanbuuren Thank you for your comment. In our simulations we cover a large range of missing "scenarios". We test different missing proportions with different missing number of variables and sample sizes. We rotate which variables are missing as well. See it as a fully factorial design. So we have cases where things are simpler, I just chose one case where I was able to reproduce the problem easily and consistently with the iris dataset.
Regarding your feedback above, why would it make a difference if there are missing values in one variable vs three ?
Thanks again, Christos
Just want to make sure that maxit
is high enough. It would be good to know whether the problem also occurs with one incomplete variable and with maxit = 1
. If things are fine in that case, then you might have a convergence issue.
Hi @stefvanbuuren
Same code as above with maxit = 1
(3 variables missing).
0.0323509 [0.02528761, 0.03941419].
If I set only Sepal.Length
to NA and maxit = 1
. (1 variable missing)
-0.0035049 [-0.006931213, -0.00007858704]. So still something is weird.
When I tried only Petal.Width
the estimate came unbiased.
I have tried increasing m
and maxit
and was not helpful with my current issue.
Inspecting visually I don't see a problem with convergence as well.
Thanks, Christos
The mean of Petal.width
in the complete data is 1.2, so the percent bias (PB) is equal to 100*0.0035/1.2 = 0.3%. One rule of thumb is that we speak of bias when PB > 5%. Here it is clearly lower.
http://stefvanbuuren.name/fimd/sec-evaluation.html#sec:evaluationcriteria
@stefvanbuuren I see, thank you for the reference and reply. Indeed the magnitude of bias is extremely small in this case. For us is very important to understand why and what could in theory explain this behaviour. Especially if you want to trust your analysis and want to publish some findings.
MICE vs Complete case analysis should yield very similar results when we resample thousands of times completely at random in a controlled simulation setting. This was the hypothesis that we had even before starting coding. Any additional help understanding this would be greatly appreciated! Thanks again, Christos
I guess it's a combination of a small sample, predictive mean matching and a skewed distribution that causes the (tiny) bias. Tim Morris and Florian Meinfelder made detailed studies of PMM for regions with sparse data.
Life's not perfect, and I would be happy with a bias of this size. I would be much more worried if we didn't find proper coverage of the confidence interval. Did you look at that also?
Hi @stefvanbuuren , thank you I will have a look on the study you mentioned. Yes, we look into coverage as well as empirical variance and model variance. We noticed some weird results also there.
Running our simulation 2000 times, with passive imputation and only 1 variable missing (Sepal.Length) and default settings I have:
Looks ok, as expected empirical standard error of MICE is larger than the true variance of the complete dataset (red line, analytically calculated)
Now this is a bit problematic. Mice's model variance seems to be underestimating the true variance. That could indicate that we might have problems with coverage. I would expect higher model variance than the true, resulting in Relative % error in ModSE around 0%.
All performance metrics are calculated based on Morris et al (2019)
Do you have any input on the model variance and coverage issues? When we have more variables missing and trying to impute with MICE these problems are getting amplified.
Thanks again, Christos
I suspect coverage will improve if you replace z = 1.96 for infinite samples by the t-distribution.
Hi all, I am writing here as we are experiencing some unexpected behavior of MICE in our simulations.
The setup goes as follows:
I have a composite score that is consisting of 4 components. That score is simply the average of the 4 components. Our goal is to multiply impute the missing component values in order to have an unbiased and precise estimate for the score. Our simulation’s estimand is the expected value of this score.
In our project, we draw nsim random samples from the original fully observed dataset. Once done that, I introduce missing values in a monotonic missing pattern. Simply, if there is a missing entry, its missing from all components(check code how it is done):
The underlying mechanism is MCAR as these missing values are introduced completely at random.
Once, we have the imputed datasets, we use the pooling process to get our estimates.
The code below is a reproducible example with the iris dataset. In reality the score I am referring to is a patient reported disease activity score consisting of 6 questions.
Given the results above, I can refer that the estimate of the expected value of the score is biased. This is something unexpected, and we cannot understand it. Under MCAR, both complete cases analysis and multiple imputation should be unbiased.
I understand that the dataset size is small (100), and I set 70% of it missing, but it still should be unbiased. Possibly larger Cis but still unbiased. I also use the default settings for m and maxit in the mice function.
Obviously, we are missing something and we would appreciate it very much if you could give us some help.
Thank you and best, Chris