Oracen-zz / MIDAS

Multiple imputation utilising denoising autoencoder for approximate Bayesian inference
Apache License 2.0
119 stars 28 forks source link

Categorical variables and multiple imputation #14

Closed richardwu closed 5 years ago

richardwu commented 5 years ago

For categorical variables I understand we one hot encode variables and take the argmax as the imputation result.

With multiple iterations, numerical values are averaged and the resulting mean is taken as the model's prediction. What is the recommended way to do this for categorical variables? Should the plurality be taken as the final imputation?

Additionally, would it be valid to simply take a single iterations imputation result as the model's prediction? Are there any bounds on the bias of the model as a function on the number of iterations?

Thanks!

ranjitlall commented 5 years ago

Some advice on these issues can be found in Section 3.2 of: Lall, Ranjit. "How multiple imputation makes a difference." Political Analysis 24, no. 4 (2016): 414-433.

What do you plan to do with the m imputed datasets? If you'll be analyzing them, you should leave them as they are and combine the results of the m separate analyses using the "Rubin combination rules."

In general, you'll need at least several imputed datasets for valid estimation. Lall's suggested rule of thumb is that m should be equal to the average missing-data rate of all variables in the imputation model.