Inflated type 1 error using `nnet`

bcjaeger commented 2 months ago

Hello,

thank you for all you do with the mice package. I use it frequently and appreciate its succinct syntax and depth of its features.

Context of this issue

A colleague of mine wants to do single imputation with a prediction model, and I feel nervous about that because the "Imputation is not prediction" sections of the mice textbook make a very strong case for not doing it. So I wrote a simulation catered to my colleague's analysis to help illustrate that mice would be a safer option, but the results of the simulation aren't what I expected.

Describe the bug

When using mice with nnet to do multinomial regression, type 1 error appears to be a little higher than expected. This could be nnet's problem. I'm happy to try running the simulation with logistic regression instead if you'd like me to test that hypothesis. If it is a mice problem, I am sorry I can't point to exactly where in mice the problem might be at the moment.

Reproducible example

In the reprex below, I simulate a three category outcome, dx, and three variables that have no association to dx. I model dx two ways:

using a single imputation with a column named dx_predicted
using multiple imputation with mice

suppressPackageStartupMessages({
  library(tidyverse)
  library(data.table)
  library(nnet)
  library(mice)
})

simulation_run_reprex <- function() {

  dx_levels <- c("A", "B", "C")
  n_obs <- 5000

  data_iter <- data.table(
    dx = factor(sample(x = dx_levels,
                       size = n_obs,
                       replace = TRUE))
  )

  # disagree column is used to make some of the predicted
  # classes disagree with the true classes.

  # set_to_miss column is used to ampute the true dx

  data_iter[, `:=`(dx_predicted = dx,
                   disagree = runif(n_obs),
                   set_to_miss = runif(n_obs))]

  data_iter[
    disagree < 0.3, # prediction model is right 70% of the time
    dx_predicted := fcase(
      dx == "A", sample(dx_levels[-1], size = .N, replace = TRUE),
      dx == "B", sample(dx_levels[-2], size = .N, replace = TRUE),
      dx == "C", sample(dx_levels[-3], size = .N, replace = TRUE)
    )
  ]

  data_iter[
    # true dx is missing 60% of the time
    , dx := fifelse(set_to_miss < 0.6, NA, dx)
  ]

  data_iter[
    , `:=`(
      # noise variables - should get type 1 error of 5%
      x_null = rnorm(.N),
      z_null = rnorm(.N),
      w_null = rnorm(.N)
    )
  ]

  # a single imputation approach using predictions
  fit_pred <- multinom(dx_predicted ~ x_null + z_null + w_null,
                       data = data_iter)

  z <- summary(fit_pred)$coefficients/summary(fit_pred)$standard.errors
  p <- (1 - pnorm(abs(z), 0, 1)) * 2

  results_pred <- tibble(pval_B = p['B', -1L],
                         pval_C = p['C', -1L],
                         variable = colnames(p)[-1L],
                         method = 'impute_pred')

  # multiple imputation
  imp <- data_iter %>%
    select(dx, ends_with("null")) %>%
    mice(print = FALSE)

  fit_imp <- with(imp, multinom(dx ~ x_null + w_null + z_null,
                                model = TRUE))

  results_imp <- summary(pool(fit_imp)) %>%
    filter(term != '(Intercept)') %>%
    select(variable = term, y.level, p.value) %>%
    pivot_wider(names_from = y.level,
                names_prefix = 'pval_',
                values_from = p.value) %>%
    mutate(method = 'impute_mice')

  bind_rows(
    results_pred,
    results_imp
  )

}

set.seed(321)

results <- replicate(
  n = 100,
  expr = {simulation_run_reprex()},
  simplify = F
)

results %>%
  bind_rows() %>%
  group_by(method) %>%
  summarize(across(.cols = starts_with("pval"),
                   .fns = ~mean(.x < 0.05)))
#> # A tibble: 2 × 3
#>   method      pval_B pval_C
#>   <chr>        <dbl>  <dbl>
#> 1 impute_mice 0.147  0.127 
#> 2 impute_pred 0.0633 0.0733

^{Created on 2024-08-13 with reprex v2.0.2}

Here is my session info


#> R version 4.2.3 (2023-03-15 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_VIR.utf8  LC_CTYPE=English_VIR.utf8   
#> [3] LC_MONETARY=English_VIR.utf8 LC_NUMERIC=C                
#> [5] LC_TIME=English_VIR.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] mice_3.16.0       nnet_7.3-18       data.table_1.15.4 lubridate_1.9.3  
#>  [5] forcats_1.0.0     stringr_1.5.0     dplyr_1.1.3       purrr_1.0.2      
#>  [9] readr_2.1.4       tidyr_1.3.0       tibble_3.2.1      ggplot2_3.5.1    
#> [13] tidyverse_2.0.0  
#> 
#> loaded via a namespace (and not attached):
#>  [1] shape_1.4.6       tidyselect_1.2.0  xfun_0.40         splines_4.2.3    
#>  [5] lattice_0.20-45   colorspace_2.0-3  vctrs_0.6.4       generics_0.1.3   
#>  [9] htmltools_0.5.8.1 yaml_2.3.8        pan_1.6           survival_3.5-3   
#> [13] utf8_1.2.2        rlang_1.1.4       nloptr_2.0.3      jomo_2.7-3       
#> [17] pillar_1.9.0      glue_1.6.2        withr_2.5.0       foreach_1.5.2    
#> [21] lifecycle_1.0.3   munsell_0.5.0     gtable_0.3.0      codetools_0.2-19 
#> [25] evaluate_0.23     knitr_1.45        tzdb_0.4.0        fastmap_1.1.0    
#> [29] fansi_1.0.3       broom_1.0.5       Rcpp_1.0.9        scales_1.3.0     
#> [33] backports_1.4.1   lme4_1.1-30       fs_1.6.3          hms_1.1.3        
#> [37] digest_0.6.33     stringi_1.7.6     grid_4.2.3        cli_3.6.1        
#> [41] tools_4.2.3       magrittr_2.0.3    glmnet_4.0        pkgconfig_2.0.3  
#> [45] MASS_7.3-60       Matrix_1.5-3      reprex_2.0.2      timechange_0.2.0 
#> [49] minqa_1.2.4       rmarkdown_2.25    rstudioapi_0.15.0 iterators_1.0.14 
#> [53] rpart_4.1.19      boot_1.3-28.1     mitml_0.4-3       R6_2.5.1         
#> [57] nlme_3.1-162      compiler_4.2.3

^{Created on 2024-08-13 with reprex v2.0.2}

thomvolker commented 2 months ago

Correct me if I'm wrong, but your categories are sampled randomly with equal proportions without any relations to the features. Then, your "miss-predictions" are again randomly sampled from the other two categories, but since the categories are not in any way related to the predictors, the joint distribution of your data is not affected by this change. Hence, your "prediction-as-imputation" approach deals with a complete sample from a distribution equal to your population distribution, so it makes sense that this approach yields confidence valid inferences.

The question then is why mice yields confidence invalid results. To be honest, I don't know, and I cannot reproduce your example, as I got the following error

Error in fifelse(set_to_miss < 0.6, NA, dx) : 
  'yes' is of type logical but 'no' is of type integer. Please make sure that both arguments have the same type.

After changing the fifelse() call to ifelse(), I got the following results, which don't seem to indicate any problems (note that I could also not reproduce your final results table, but didn't bother to solve that, the below provides the same information).

suppressPackageStartupMessages({
  library(tidyverse)
  library(data.table)
  library(nnet)
  library(mice)
})

simulation_run_reprex <- function() {

  dx_levels <- c("A", "B", "C")
  n_obs <- 5000

  data_iter <- data.table(
    dx = factor(sample(x = dx_levels,
                       size = n_obs,
                       replace = TRUE))
  )

  # disagree column is used to make some of the predicted
  # classes disagree with the true classes.

  # set_to_miss column is used to ampute the true dx

  data_iter[, `:=`(dx_predicted = dx,
                   disagree = runif(n_obs),
                   set_to_miss = runif(n_obs))]

  data_iter[
    disagree < 0.3, # prediction model is right 70% of the time
    dx_predicted := fcase(
      dx == "A", sample(dx_levels[-1], size = .N, replace = TRUE),
      dx == "B", sample(dx_levels[-2], size = .N, replace = TRUE),
      dx == "C", sample(dx_levels[-3], size = .N, replace = TRUE)
    )
  ]

  data_iter[
    # true dx is missing 60% of the time
    , dx := ifelse(set_to_miss < 0.6, NA, dx)
  ]

  data_iter[
    , `:=`(
      # noise variables - should get type 1 error of 5%
      x_null = rnorm(.N),
      z_null = rnorm(.N),
      w_null = rnorm(.N)
    )
  ]

  # a single imputation approach using predictions
  fit_pred <- multinom(dx_predicted ~ x_null + z_null + w_null,
                       data = data_iter, trace = FALSE)

  z <- summary(fit_pred)$coefficients/summary(fit_pred)$standard.errors
  p <- (1 - pnorm(abs(z), 0, 1)) * 2

  results_pred <- tibble(pval_B = p['B', -1L],
                         pval_C = p['C', -1L],
                         variable = colnames(p)[-1L],
                         method = 'impute_pred')

  # multiple imputation
  imp <- data_iter %>%
    select(dx, ends_with("null")) %>%
    mice(print = FALSE)

  fit_imp <- with(imp, multinom(dx ~ x_null + w_null + z_null,
                                model = TRUE, trace = FALSE))

  results_imp <- summary(pool(fit_imp)) %>%
    filter(term != '(Intercept)') %>%
    select(variable = term, y.level, p.value) %>%
    pivot_wider(names_from = y.level,
                names_prefix = 'pval_',
                values_from = p.value) %>%
    mutate(method = 'impute_mice')

  bind_rows(
    results_pred,
    results_imp
  )

}

set.seed(321)

results <- replicate(
  n = 100,
  expr = {simulation_run_reprex()},
  simplify = F
)

results %>%
  bind_rows() %>%
  group_by(method) %>%
  summarize(across(.cols = starts_with("pval"),
                   .fns = ~mean(.x < 0.05)))
#> # A tibble: 2 × 5
#>   method      pval_B  pval_C  pval_2  pval_3
#>   <chr>        <dbl>   <dbl>   <dbl>   <dbl>
#> 1 impute_mice  NA    NA       0.0167  0.0567
#> 2 impute_pred   0.04  0.0433 NA      NA

results %>%
  bind_rows() %>%
  group_by(method, variable) %>%
  summarize(across(.cols = starts_with("pval"),
                   .fns = ~mean(.x < 0.05)))
#> `summarise()` has grouped output by 'method'. You can override using the
#> `.groups` argument.
#> # A tibble: 6 × 6
#> # Groups:   method [2]
#>   method      variable pval_B pval_C pval_2 pval_3
#>   <chr>       <chr>     <dbl>  <dbl>  <dbl>  <dbl>
#> 1 impute_mice w_null    NA     NA      0.04   0.08
#> 2 impute_mice x_null    NA     NA      0      0.03
#> 3 impute_mice z_null    NA     NA      0.01   0.06
#> 4 impute_pred w_null     0.05   0.06  NA     NA   
#> 5 impute_pred x_null     0.05   0.02  NA     NA   
#> 6 impute_pred z_null     0.02   0.05  NA     NA

^{Created on 2024-08-14 with reprex v2.0.2}

bcjaeger commented 2 months ago

Thank you!

You're correct about the prediction model in the simulated example. In the real simulation, the scenario is more nuanced, but I'm not authorized to share the data used in the real simulation.

Sorry about fifelse() throwing an error on your side. After changing to ifelse() on my side, I see the dx column was coerced into an integer instead of a factor. I added one extra command in the reprex below to coerce dx back into a factor. I think this will make the result reproducible on your side. Can you confirm?

suppressPackageStartupMessages({
  library(tidyverse)
  library(data.table)
  library(nnet)
  library(mice)
})

simulation_run_reprex <- function() {

  dx_levels <- c("A", "B", "C")
  n_obs <- 5000

  data_iter <- data.table(
    dx = factor(sample(x = dx_levels,
                       size = n_obs,
                       replace = TRUE))
  )

  # disagree column is used to make some of the predicted
  # classes disagree with the true classes.

  # set_to_miss column is used to ampute the true dx

  data_iter[, `:=`(dx_predicted = dx,
                   disagree = runif(n_obs),
                   set_to_miss = runif(n_obs))]

  data_iter[
    disagree < 0.3, # prediction model is right 70% of the time
    dx_predicted := fcase(
      dx == "A", sample(dx_levels[-1], size = .N, replace = TRUE),
      dx == "B", sample(dx_levels[-2], size = .N, replace = TRUE),
      dx == "C", sample(dx_levels[-3], size = .N, replace = TRUE)
    )
  ]

  data_iter[
    # true dx is missing 40% of the time
    # (this coerces it into an integer)
    , dx := ifelse(set_to_miss < 0.6, NA, dx)
  ]

  data_iter[
    # convert dx back into a factor
    , dx := factor(dx, labels = dx_levels)
  ]

  data_iter[
    , `:=`(
      # noise variables - should get type 1 error of 5%
      x_null = rnorm(.N),
      z_null = rnorm(.N),
      w_null = rnorm(.N)
    )
  ]

  # a single imputation approach using predictions
  fit_pred <- multinom(dx_predicted ~ x_null + z_null + w_null,
                       data = data_iter)

  z <- summary(fit_pred)$coefficients/summary(fit_pred)$standard.errors
  p <- (1 - pnorm(abs(z), 0, 1)) * 2

  results_pred <- tibble(pval_B = p['B', -1L],
                         pval_C = p['C', -1L],
                         variable = colnames(p)[-1L],
                         method = 'impute_pred')

  # multiple imputation
  imp <- data_iter %>%
    select(dx, ends_with("null")) %>%
    mice(print = FALSE)

  fit_imp <- with(imp, multinom(dx ~ x_null + w_null + z_null,
                                model = TRUE))

  results_imp <- summary(pool(fit_imp)) %>%
    filter(term != '(Intercept)') %>%
    select(variable = term, y.level, p.value) %>%
    pivot_wider(names_from = y.level,
                names_prefix = 'pval_',
                values_from = p.value) %>%
    mutate(method = 'impute_mice')

  bind_rows(
    results_pred,
    results_imp
  )

}

set.seed(321)

results <- replicate(
  n = 100,
  expr = {simulation_run_reprex()},
  simplify = F
)

results %>%
  bind_rows() %>%
  group_by(method) %>%
  summarize(across(.cols = starts_with("pval"),
                   .fns = ~mean(.x < 0.05)))
#> # A tibble: 2 × 3
#>   method      pval_B pval_C
#>   <chr>        <dbl>  <dbl>
#> 1 impute_mice 0.147  0.127 
#> 2 impute_pred 0.0633 0.0733

^{Created on 2024-08-13 with reprex v2.0.2}

stefvanbuuren commented 2 months ago

In your code

# a single imputation approach using predictions
  fit_pred <- multinom(dx_predicted ~ x_null + z_null + w_null,
                       data = data_iter)

dx_predicted has no missing data, and no imputations are calculated.

I suggest changing your code so that multinom() predicts values that are used for single imputation. I expect that the analysis of that singly imputed dataset will show P-values that are too low.

thomvolker commented 2 months ago

No worries, @bcjaeger ! I can imagine that the irreproducibility has something to do with having different versions of packages, or something like that.

Now I can indeed reproduce your results, and see mice yields $p$-values that are too small on average. I'm not entirely sure why this happens. Also, if you change the imputation method to pmm, the results will be confidence valid, but when using cart or rf, you also get too many small $p$-values. I agree with @stefvanbuuren that the single imputation approach will probably also yield invalid $p$-values here, but this does not explain why mice also yields invalid $p$-values. The issue seemed to be that variance estimates are too small (since the parameter estimates appeared to be unbiased), but I don't know enough about polytomous regression to quickly identify why this issue occurred.

stefvanbuuren commented 2 months ago

@thomvolker I thought that 0.147 was the average P-value found for mice, so aren't these larger (not smaller) than 0.05?

thomvolker commented 2 months ago

@stefvanbuuren, the 0.147 is the type-1 error rate (the proportion of $p$-values smaller than 0.05, considering that the predictors are independent of the outcome).

stefvanbuuren commented 2 months ago

@thomvolker Ah thanks. So the confidence interval is too short, which shouldn't be the case.

The first explanation that springs to mind is that the neural network is overfitting the data. A simpler model with fewer parameters should be more robust against this.

I deleted the BUG label since this is a methodological issue rather than a programming bug.

thomvolker commented 2 months ago

That could be, but the model has only 8 parameters (two intercepts and 6 regression slopes) on 5000 observations, which does not seem problematic. I also checked whether there is some implicit regularization (e.g., due to the data augmentation approach) that reduced the between imputation variance, but this also did not seem to be the case. And finally, I checked whether the method = 'sample' approach, which should be in line with the population model, yields confidence valid results (but this was also not the case, although here the confidence intervals were too wide).

stefvanbuuren commented 2 months ago

By default, mice uses nnet.MaxNWts = 1500 to set nnet::nnet argument MaxNWts. Perhaps reducing it to its default 1000 could help? EDIT: This parameter is a safety valve, so changing it will probably have no impact on the statistical properties.

stefvanbuuren commented 2 months ago

Not really a software issues, so closing

bcjaeger commented 2 months ago

Thank you @stefvanbuuren and @thomvolker for the help!

amices / mice

Inflated type 1 error using `nnet` #661