amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
433 stars 107 forks source link

Does 2l.pmm bias correlations downwards? #462

Closed skramer1958 closed 2 years ago

skramer1958 commented 2 years ago

I'm trying to impute data on a multilevel data set (students in classrooms in schools). I'm using 2l.pmm to account for school-level correlations, since school was the unit of assignment in my experimental study.

However, I notice that the 2l.pmm method tends to impute variables with much lower correlations than were in the original data set. The problem gets worse the more variables I include in the multiple imputation. For example, students took three science unit tests over two years, two in sixth grade (units called DL and WW) and one in seventh grade (unit called EH). There is lots of missing data for each test. Here are the original correlations for a subset of the data (control group in cohort 1 in one particular state): Correlation between test DL and test WW : 0.555 Correlation between DL and EH: 0.534 Correlation between WW and EH: 0.520

Using a large data set and a bunch of other relevant variables many of which had missing data (race, gender, disadvantaged status, school mean on minority and disadvantaged, fourth and fifth grade math and reading scores, classroom averages on most of these variables), I imputed a data set using pmm. The first imputation gives a good idea of the results. Here were the correlations on the first imputation: Correlation between test DL and test WW : 0.547 Correlation between DL and EH: 0.500 Correlation between WW and EH: 0.542

But when I used 2l.pmm to compute these three and other variables with missing data (school as cluster variable) the correlations were much lower. Correlation between test DL and test WW : 0.421 Correlation between DL and EH: 0.331 Correlation between WW and EH: 0.250 Now it is reasonable that correlations might be a bit attenuated in the full data set, if students with missing data for example tend to be lower scorers who have lower inter-correlations. But the results above don't seem to meet the "sniff test". The attenuation is too much. Plus, these are results I obtained after dropping a bunch of variables from the imputation model, because the more variables I add the lower the correlations get.

Note that these results are for a "subset of the data", but I get the same thing on all of the subsets. If I don't subset and instead include interactions in the model, the correlations get even lower for 2l.pmm. Meanwhile, the pmm approach continues to reproduce correlations very near those in the original data set.

How can I know if 2l.pmm is worth using, i.e., diagnose whether the imputed data sets are reasonable; and whether the 2-level imputation is worse (more biased) than single-level imputation or perhaps even worse (more biased) than listwise deleting missing data?

hanneoberman commented 2 years ago

Hi @skramer1958, would it be possible to post a reproducible example for this issue? See e.g. https://reprex.tidyverse.org/articles/articles/learn-reprex.html. FYI, another useful resource is the mice vignette about imputing multilevel data: https://www.gerkovink.com/miceVignettes/Multi_level/Multi_level_data.html.

skramer1958 commented 2 years ago

Hanne, This may be a foolish question, but the webinar https://reprex.tidyverse.org/articles/articles/learn-reprex.html is using something called R Studio, which is not the same as R. Do I need to do tutorials on R studio and download and start using that before I can begin making a reproducible example? Steve

From: Hanne Oberman @.> Sent: Monday, January 24, 2022 12:52 PM To: amices/mice @.> Cc: Steven Kramer @.>; Mention @.> Subject: Re: [amices/mice] Does 2l.pmm bias correlations downwards? (Issue #462)

Hi @skramer1958https://github.com/skramer1958, would it be possible to post a reproducible example for this issue? See e.g. https://reprex.tidyverse.org/articles/articles/learn-reprex.html. FYI, another useful resource is the mice vignette about imputing multilevel data: https://www.gerkovink.com/miceVignettes/Multi_level/Multi_level_data.html.

— Reply to this email directly, view it on GitHubhttps://github.com/amices/mice/issues/462#issuecomment-1020375987, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AXOICQ5WXGKP5NNJ7ZNVUPLUXWGUNANCNFSM5MV3HEOQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.Message ID: @.**@.>>

hanneoberman commented 2 years ago

Hi Steve, No, you don't need RStudio for {reprex} to work. RStudio is just a sort of user interface to run R (among other things). On the package's GitHub repo, https://github.com/tidyverse/reprex, it tells you that the reprex() function will render html for you, which you can open in your browser.