amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
428 stars 107 forks source link

[Question] How to set up methods and predictors matrix for this longitudinal example? #540

Closed Generalized closed 1 year ago

Generalized commented 1 year ago

Dear @stefvanbuuren

I apologize for a probably trivial question, but I must have missed something trivial and cannot set up a simple example... thus asking for help.

The data: it's an example of a 2-arm 2-timepoints clinical trial. I want to compare treatment A vs. B, at timepoint V1 (visit 1) and V2.

It is a longitudinal study, so can be expressed as a multi-level model. It has the

In a real, multi-timepoint example, there would be much more timepoints, the responses would be correlated within cluster (patients) and might have different variance at each timepoint. But let's try this simplified example.

For simplicity let's assume there's just single response variable "val" (value). "valBas is the baseline, for which the model will be adjusted.

The data:

> data <- structure(list(id = structure(c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 
4L, 5L, 6L, 7L, 8L, 9L, 10L, 6L, 7L, 8L, 9L, 10L), levels = c("1", 
"2", "3", "4", "5", "101", "102", "103", "104", "105"), class = "factor"), 
    tim = structure(c(2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 
    2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), levels = c("Bas", 
    "V1", "V2"), class = "factor"), trt = structure(c(1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L), levels = c("A", "B"), class = "factor"), val = c(0.93451070190448, 
    0.844132115922294, 2.03975069147444, 1.30149437711243, -0.0531177589385456, 
    2.51757217231966, 2.94551129841814, 3.62138118865766, 3.37912768406252, 
    2.16395895144539, 6.05007166431368, 2.15298399368789, 4.21424733194768, 
    3.91257174705049, 2.73317503681323, 6.10316126083229, 3.71615586428783, 
    5.83494216711747, 6.06097572188509, 6.72093565459939), timBas = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L), levels = c("Bas", "V1", "V2"), class = "factor"), 
    valBas = c(-0.445778264836677, -1.2058565689643, 0.0411263138456899, 
    0.639388407571143, -0.786554355912735, -0.445778264836677, 
    -1.2058565689643, 0.0411263138456899, 0.639388407571143, 
    -0.786554355912735, 0.170057481208407, 0.155078715940733, 
    0.0249318673672384, -2.04658541402115, 0.213154105608615, 
    0.170057481208407, 0.155078715940733, 0.0249318673672384, 
    -2.04658541402115, 0.213154105608615)), row.names = c(NA, 
-20L), class = "data.frame")

Which looks like this:
> data
    id tim trt         val timBas      valBas
1    1  V1   A  0.93451070    Bas -0.44577826
2    2  V1   A  0.84413212    Bas -1.20585657
3    3  V1   A  2.03975069    Bas  0.04112631
4    4  V1   A  1.30149438    Bas  0.63938841
5    5  V1   A -0.05311776    Bas -0.78655436
6    1  V2   A  2.51757217    Bas -0.44577826
7    2  V2   A  2.94551130    Bas -1.20585657
8    3  V2   A  3.62138119    Bas  0.04112631
9    4  V2   A  3.37912768    Bas  0.63938841
10   5  V2   A  2.16395895    Bas -0.78655436
11 101  V1   B  6.05007166    Bas  0.17005748
12 102  V1   B  2.15298399    Bas  0.15507872
13 103  V1   B  4.21424733    Bas  0.02493187
14 104  V1   B  3.91257175    Bas -2.04658541
15 105  V1   B  2.73317504    Bas  0.21315411
16 101  V2   B  6.10316126    Bas  0.17005748
17 102  V2   B  3.71615586    Bas  0.15507872
18 103  V2   B  5.83494217    Bas  0.02493187
19 104  V2   B  6.06097572    Bas -2.04658541
20 105  V2   B  6.72093565    Bas  0.21315411

Now let's prepare the data and remove a few observations from the response:

data_mis <- data
data_mis$id <- as.numeric(as.character(data_mis$id))
data_mis$tim <- as.factor(data_mis$tim)
data_mis$trt <- as.factor(data_mis$trt)

data_mis[c(5, 8, 14, 15), "val"] <- NA

Now I'm trying to set up the necessary parameters. I would like to use PMM method. I cannot guarantee (have no priori knowledge about) the normality of its distribution. PMM seems a safer option to the researchers I cooperate with.

I want to impute the missing variables while accounting for the cluster (patient) and the timepoint (tim).

I'm not sure, should I use this "pmm" or "2lonly.pmm"? Maybe I should use the 2l.pmm from the miceadds package?

imp0 <- mice(data_mis, maxit=0)
pred <- imp0$predictorMatrix
meth <- imp0$method
meth[c("val")] <- "2lonly.pmm"  #is this OK for my case?

pred[,"id"]  <- -2  # the cluster 
pred[,"tim"] <- 2  # time timepoint

Now, when I'm trying to impute the missing values, it says:

> data_imp <- mice(data_mis, m = 5, method = meth, predictorMatrix = pred, seed = 123)

 iter imp variable
  1   1  valError in .imputation.level2(y = y, ry = ry, x = x, type = type, wy = wy,  : 
  Method 2lonly.pmm found the following clusters with partially missing
  level-2 data: 3, 5, 9, 10
  Method 2lonly.mean can fix such inconsistencies.

With ordinary "pmm" it works fine, but does this method account for the clustering?

Let's try the package 2l.pmm from miceadds.

mice_imputation_2l_lmer Imputation of a Continuous or a Binary Variable From a Two-Level Regression Model using lme4 or blme

It fails if using the time variable (too many parameters for both correlated random slopes and intercepts)

> library(miceadds)
> meth[c("val")] <- "2l.pmm"

> pred[,"id"]  <- -2  
> pred[,"tim"] <- 2

> data_imp <- mice(data_mis, m = 5, method = meth, predictorMatrix = pred, seed = 123)

 iter imp variable
  1   1  valError: number of observations (=16) <= number of random effects (=20) for term (1 + timV1 | id); the random-effects parameters and the residual variance (or scale parameter) are probably unidentifiable

but works if I skip the time and only leave the random intercepts:

> meth[c("val")] <- "2l.pmm"

> pred[,"id"]  <- -2  
> pred[,"tim"] <- 1  # just ordinary variable

> data_imp <- mice(data_mis, m = 5, method = meth, predictorMatrix = pred, seed = 123)

 iter imp variable
  1   1  val
  1   2  val
  1   3  val
  1   4  val
  1   5  val
  2   1  val
  2   2  val
  2   3  val
  2   4  val
  2   5  val
  3   1  val
  3   2  val
  3   3  val
  3   4  val
  3   5  val
  4   1  val
  4   2  val
  4   3  val
  4   4  val
  4   5  val
  5   1  val
  5   2  val
  5   3  val
  5   4  val
  5   5  val
Warning message:
Number of logged events: 25 

PS: here it's only one missing value "val". II may use more of them, say, "val1", "val2", "val3" (all numeric, of potentially different and not necessarily normal distributions) and maybe "val4" binary (yes/no), all with potential gaps.

Would be the setup accounting for ID (and optionally TIME) very different if using them? I saw very different setups of the predictor matrix... I want these additional variables to be included in imputed in a chain, one by one. Then they will be analysed as well.

stefvanbuuren commented 1 year ago

My advice is to organise your data in the wide matrix. Because observations are then exchangeable, it will be a lot easier to impute the data. See http://stefvanbuuren.name/fimd/sec-longandwide.html and http://stefvanbuuren.name/fimd/sec-fdd.html. Hope this helps.

Generalized commented 1 year ago

Thank you very much! Very helpful!