Pooling rules for creating synthetic data with mice

amices / mice

Multivariate Imputation by Chained Equations

https://amices.org/mice/

GNU General Public License v2.0

444 stars 107 forks source link

Pooling rules for creating synthetic data with mice #436

Closed thomvolker closed 3 years ago

thomvolker commented 3 years ago

As discussed with @gerkovink, the pool.syn() and pool.scalar.syn() pooling functions apply the rules developed by Reiter (2003) to combine analyses on multiply imputed synthetic datasets. Note that these rules only apply to synthetic versions of completely observed datasets. If the data to synthesize contains missing values, different pooling rules apply that require a two-step approach to imputation (first impute missingness, than synthesize all m imputed datasets). Developing a one-step approach would be something for future research.

gerkovink commented 3 years ago

@stefvanbuuren Can we prioritise this PR?

stefvanbuuren commented 3 years ago

Thanks for the PR.

There's a lot of duplicated code. I will look into the possibility to integrate this functionality as an extra argument to the regular pool() function.

thomvolker commented 3 years ago

I completely agree that this PR is mostly duplicate code. The reason to still write an additional function was to protect uninformed users against using wrong pooling rules. Still, an additional argument is probably more elegant.

stefvanbuuren commented 3 years ago

mice 3.13.15 adds a new rule argument to pool() and pool.scalar() and redefines pool.syn() and pool.scalar.syn() as wrappers. This removes almost all duplication and is extendable as other pooling rule come along.

Use pool.syn() and pool.scalar.syn() in code for synthetic data, and reserve pool() and pool.scalar() for missing data uses.

gerkovink commented 3 years ago

Nice indeed to separate the workflow between pool() and pool.syn()