Matteo21Q / jomo

R package for Joint Modelling Imputation
3 stars 0 forks source link

Include fully-observed variables as a response or a covariate in MVNI? #2

Open willizhang opened 1 year ago

willizhang commented 1 year ago

Hello! I'm unsure if I'm in the right place to ask a question. :) I've been utilizing the JOMO package and have found immense value in reading several of your research papers. They've provided me with a wealth of information and guidance. I currently have a question about the best approach for including fully-observed variables within MVNI models: should they be considered responses or covariates? I would greatly appreciate your insights on this matter.

Suppose I have a substantive analysis model, with missing values in X1 ONLY: Y = b0 + b1 * X1 + b2 * X2 + b3 * X3 + b4 * X2 * X3

I would like to use MVNI to impute missing values in JOMO.

My questions related to MVNI are:

  1. Considering congeniality, how one can decide whether including the fully-observed variables (X2, X3, Y) in the MVNI, as response variable or covariate?
  2. Can the fully-observed interaction term X2 * X3 be included as covariate?
  3. In this case, since there is only one missing variable, it seems that MVNI is not applicable if X2, X3 and Y are included as covariate for imputing X1.

References to question 1 and 2: Question 1: In the Book "Carpenter JR, Kenward MG. Multiple Imputation and Its Application (First Edition). John Wiley & Sons, Ltd. 2013." (p. 129), it says that "For fully observed continuous and binary variables, the conditional distribution for imputing the partially observed variables will be practically equivalent, whether they are included as a response or as covariates."

Question 2: a. In the same book (p. 131): "Summarising, when a quadratic, and in general nonlinear, relationship involving a fully observed variable is important in the substantive model, this nonlinear relationship must be included in the linear predictor for each partially observed variable in the imputation model, whether a joint or FCS approach is adopted."

b. In your paper "jomo: A Flexible Package for Two-level Joint Modelling Multiple Imputation", on p. 21, it says "When interactions or non-linear terms are present in the model of interest, ignoring them in the imputation model may lead to bias; instead, they should be included as covariates (Carpenter and Kenward, 2013, p. 130)."

Thank you so much again for the these very helpful papers!

Matteo21Q commented 1 year ago

Hi Willi,

Thanks for getting in touch and for your kind words!

This is a good question and one I can see one can get confused about quite easily!

  1. In general, the advantage of including fully-observed variables as covariates is that you do not make any additional distributional assumption about them. The disadvantage is that in certain cases it might be trickier to explore congeniality/compatibility. In general, I found in the past that including the outcome variable Y as outcome in the imputation model as well even if fully observed can make exploration of congeniality easier. There are situations where I found the same for other covariates but that’s way more rare.

In your specific example, because X2 and X3 are included with an interaction in the analysis model, I would definitely include them as covariates and include their interaction as well. However, if you had a imputation model like: X1 = a0 + a1 Y + a2 X2 + a3 X3 + a4 X2 * X3

This would guarantee an interaction effect on X1, not on Y. So, I would include instead X1 and Y as outcomes in the imputation model, and X2, X3 and X2*X3 as covariates.

  1. Yes, you can. In standard jomo you will have to manually generate the X2X3 variable and include it in argument X. I think if you use the jomoImpute function in mitml you can instead use the formula argument to include interactions with
  2. MVNI works even with m=1. Of course it will basically be just NI, and would practically be equivalent to FCS/MICE, but there is no theoretical barrier. However, as stated, here probably easier to have a 2-variate imp model with Y and X1 as outcomes.

Hope this helps!

Matteo

From: Willi Zhang @.> Sent: Friday, August 18, 2023 9:37 AM To: Matteo21Q/jomo @.> Cc: Subscribed @.***> Subject: [Matteo21Q/jomo] Include fully-observed variables as a response or a covariate in MVNI? (Issue #2)

⚠ Caution: External sender

Hello! I'm unsure if I'm in the right place to ask a question. :) I've been utilizing the JOMO package and have found immense value in reading several of your research papers. They've provided me with a wealth of information and guidance. I currently have a question about the best approach for including fully-observed variables within MVNI models: should they be considered responses or covariates? I would greatly appreciate your insights on this matter.

Suppose I have a substantive analysis model, with missing values in X1 ONLY: Y = b0 + b1 X1 + b2 X2 + b3 X3 + b4 X2 * X3

I would like to use MVNI to impute missing values in JOMO.

My questions related to MVNI are:

  1. Considering congeniality, how one can decide whether including the fully-observed variables (X2, X3, Y) in the MVNI, as response variable or covariate?
  2. Can the fully-observed interaction term X2 * X3 be included as covariate?
  3. In this case, since there is only one missing variable, it seems that MVNI is not applicable if X2, X3 and Y are included as covariate for imputing X1.

References to question 1 and 2: Question 1: In the Book "Carpenter JR, Kenward MG. Multiple Imputation and Its Application (First Edition). John Wiley & Sons, Ltd. 2013." (p. 129), it says that "For fully observed continuous and binary variables, the conditional distribution for imputing the partially observed variables will be practically equivalent, whether they are included as a response or as covariates."

Question 2: a. In the same book (p. 131): "Summarising, when a quadratic, and in general nonlinear, relationship involving a fully observed variable is important in the substantive model, this nonlinear relationship must be included in the linear predictor for each partially observed variable in the imputation model, whether a joint or FCS approach is adopted."

b. In your paper "jomo: A Flexible Package for Two-level Joint Modelling Multiple Imputationhttps://discovery.ucl.ac.uk/id/eprint/10078316/", on p. 21, it says "When interactions or non-linear terms are present in the model of interest, ignoring them in the imputation model may lead to bias; instead, they should be included as covariates (Carpenter and Kenward, 2013, p. 130)."

Thank you so much again for the these very helpful papers!

— Reply to this email directly, view it on GitHubhttps://github.com/Matteo21Q/jomo/issues/2, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJSP74QUPODJYO3RRFMHC2DXV4STBANCNFSM6AAAAAA3VGLP5Q. You are receiving this because you are subscribed to this thread.Message ID: @.***>

willizhang commented 1 year ago

Hi Matteo,

Thank you so much for sharing your insights and expertise about this. It is very helpful! :D

Best regards, Willi

willizhang commented 1 year ago

Hi Matteo,

I am currently utilizing a weighted multinomial logistic regression model for my substantive analysis:

Y = b0 + b1*X1 + b2*X2 + b3*X1*X2 + b4*X3 + b5*X4

Here, both Y(unordered categorical variable) and X3 (ordinal variable) have missing values.

Inspired by your recent publications, I intend to implement a two-level MVNI model with latent normal variable approach (paper 1; paper 2) (which I found really helpful :)). Here is a simplified R script for my proposed model using the jomo package:

jomo( Y = data[ , c( "Y", "X3" ) ], # suppose data includes all the variables
      X = data[ , c( "constant", "X1", "X2", "X1_X2_interaction", "X4" ], # constant = 1
      clus = data$weight_strata, # level-two strata defined using weight
      meth = "random" )

Congeniality concerns (I don’t know if I understand correctly) It would be essential that the MVNI model reflect the interaction between X1 and X2 as well as the interactions between survey weights and all covariates (X1 through X4) as presented in the substantive model.

If this is true, my related questions are: Q1. In light of our earlier discussion, the MVNI model should include the fully-observed variables X1, X2, and X1*X2 as covariates. However, what is the proper way to include the fully-observed X4 in the two-level MVNI – as a covariate (as in the R script) or as a response variable? (reference to the question below)

Q2. If I restrict the response variables to only the partly-observed Y and X3, it seems that the model would not adequately capture the interactions between the weight variable and X1, X2, and X4. Would this be an important issue? :)

Q3. How can auxiliary variables be incorporated into a (multi-level) MVNI model? Would it be reasonable to consider the following strategy regardless of single-level or multilevel: include fully-observed auxiliary variables as covariates and partly-observed ones as dependent variables?

References: In your paper “jomo: A Flexible Package for Two-level Joint Modelling Multiple Imputation”, p. 4,

In joint modelling imputation, partially observed variables are dependent variables. However, as hinted above, with fully observed variables we can choose to either condition on them as predictors or include them in the (multivariate) response. The software is equally comfortable with both options, and it makes little difference in practice for single-level data. However, the choice has a bigger impact for clustered data, as we will see in the multilevel imputation section.

Another question related to the paper: p. 8:

Fully observed binary covariates can be included in the X matrix of the imputation model as type numeric, exactly as with sex in this example. To include fully observed categorical covariates with three or more categories, appropriate dummy variables have to be created. For this purpose, we might use the R package dummies (Brown, 2012) or the function constrasts in base R.

Q4. Suppose X1 is a categorical variable with 5 categories, can it be included as it is in the script, or should it be coded in a specific way?

Your insights would be highly appreciated. Thank you so much for your time and expertise!

Best regards, Willi

Matteo21Q commented 1 year ago

Hi Willi,

Yes, you are correct, you need ideally to both choose an imputation model compatible with the interaction X1*X2 and that allows for interaction between all domains and the weight strata. One way to allow for strata specific effects of variables in this case would be to include a random effect for the variables you need to include as predictors in the imputation model. So, for example put X1 and X2 in the Z design matrix alongside the constant. You could both include X4 in the same way or move it to the outcome of the imputation model. This should answer Q1 and Q2, while regarding Q3, yes, you can do that. I think it’s always just a matter of weighting pros and cons of including them as outcomes or as covariates in the imputation model. Given they are not part of the analysis model, it should have no compatibility implications, I think (unless I am missing something). So just choose depending on how comfortable you are with making additional parametric assumptions about the auxiliary variables, how many parameters you add to the model, whether the model fits quickly enough, etc etc. (And of course if they are partially observed, you can just include them as outcomes).

Finally, for Q4, if X1 is a 5-level categorical variable, you will have to create dummy variables first in standard jomo. Otherwise, it will treat the variable as continuous. However, if you use the jomoImpute interface in package mitml, I believe it should do the trick for you (but check, I am not 100% sure at the moment).

Hope this helps, Matteo

From: Willi Zhang @.> Sent: Friday, August 25, 2023 11:24 AM To: Matteo21Q/jomo @.> Cc: Quartagno, Matteo @.>; Comment @.> Subject: Re: [Matteo21Q/jomo] Include fully-observed variables as a response or a covariate in MVNI? (Issue #2)

⚠ Caution: External sender

Hi Matteo,

I am currently utilizing a weighted multinomial logistic regression model for my substantive analysis:

Y = b0 + b1X1 + b2X2 + b3X1X2 + b4X3 + b5X4

Here, both Y (unordered categorical variable) and X3 (ordinal variable) have missing values.

Inspired by your recent publications, I intend to implement a two-level MVNI model with latent normal variable approach (paper 1https://academic.oup.com/jssam/article/8/5/965/5569522; paper 2https://onlinelibrary.wiley.com/doi/10.1002/bimj.201800222) (which I found really helpful :)). Here is a simplified R script for my proposed model using the jomo package:

jomo( Y = data[ , c( "Y", "X3" ) ], # suppose data includes all the variables

  X = data[ , c( "constant", "X1", "X2", "X1_X2_interaction", "X4" ], # constant = 1

  clus = data$weight_strata, # level-two strata defined using weight

  meth = "random" )

Congeniality concerns (I don’t know if I understand correctly) It would be essential that the MVNI model reflect the interaction between X1 and X2 as well as the interactions between survey weights and all covariates (X1 through X4) as presented in the substantive model.

If this is true, my related questions are: Q1. In light of our earlier discussion, the MVNI model should include the fully-observed variables X1, X2, and X1*X2 as covariates. However, what is the proper way to include the fully-observed X4 in the two-level MVNI – as a covariate (as in the R script) or as a response variable? (reference to the question below)

Q2. If I restrict the response variables to only the partly-observed Y and X3, it seems that the model would not adequately capture the interactions between the weight variable and X1, X2, and X4. Would this be an important issue? :)

Q3. How can auxiliary variables be incorporated into a (multi-level) MVNI model? Would it be reasonable to consider the following strategy regardless of single-level or multilevel: include fully-observed auxiliary variables as covariates and partly-observed ones as dependent variables?

References: In your paper “jomo: A Flexible Package for Two-level Joint Modelling Multiple Imputation”, p. 4,

In joint modelling imputation, partially observed variables are dependent variables. However, as hinted above, with fully observed variables we can choose to either condition on them as predictors or include them in the (multivariate) response. The software is equally comfortable with both options, and it makes little difference in practice for single-level data. However, the choice has a bigger impact for clustered data, as we will see in the multilevel imputation section.

Another question related to the paper: p. 8:

Fully observed binary covariates can be included in the X matrix of the imputation model as type numeric, exactly as with sex in this example. To include fully observed categorical covariates with three or more categories, appropriate dummy variables have to be created. For this purpose, we might use the R package dummies (Brown, 2012) or the function constrasts in base R.

Q4. Suppose X1 is a categorical variable with 5 categories, can it be included as it is in the script, or should it be coded in a specific way?

Your insights would be highly appreciated. Thank you so much for your time and expertise!

Best regards, Willi

— Reply to this email directly, view it on GitHubhttps://github.com/Matteo21Q/jomo/issues/2#issuecomment-1693139121, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJSP74UHBLKUZH3DTWJSNETXXB4NRANCNFSM6AAAAAA3VGLP5Q. You are receiving this because you commented.Message ID: @.**@.>>

willizhang commented 1 year ago

Hi Matteo,

Thank you so much again for sharing your invaluable expertise and kindly helping me with my questions!


UPDATES: Regarding whether in jomoImputecategorical variables (included as predictors in imputation model) are treated as numeric or factor, I checked the output from jomoImpute and found that categorical variables (with multiple categories) remain as factor in imputed datasets. :) So it seems that there is no need to create dummy variables for categorical variables which are included as predictors in the imputation model.

Warm wishes, Willi