amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
442 stars 107 forks source link

Possible bug in ampute function #317

Closed colinorourke closed 2 years ago

colinorourke commented 3 years ago

There seems to be a bug in the ampute function that is encountered as sort of an edge case.

Let's say you have a variable which is an indicator taking on the values 0/1, and the value 1 is a bit on the rare side.

df = data.frame(x = rnorm(100), y = rnorm(100), z = rep(0,100))
df$z[75] = 1

And for a particular missing data pattern occurring with lower frequency you give score weights depending only on the indicator value.

freq = c(0.01, 0.495, 0.495)
weights = rbind(c(0,0,1), c(0,0,1), c(1,1,0))

You have a good chance of drawing for that pattern a single score of 0. This gets interpreted by part of the code as a pattern that doesn't occur, which causes that entire variable to be set to missing when amputed.

set.seed(1L)

mdf = ampute(df, freq = c(.01, 0.495, 0.495), weights=weights)
gerkovink commented 3 years ago

Thanks for your question.

You're weighting a pattern in such a way that - dependent on the seed, the function arguments and your data - there may be no room for the algorithm to induce missingness. Take e.g. the first pattern, where x may be missing, but only based on z. But z is more or less constant as you've noted.

set.seed(123)
library(mice)
#> 
#> Attaching package: 'mice'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following objects are masked from 'package:base':
#> 
#>     cbind, rbind
df <- data.frame(x = rnorm(100), 
                 y = rnorm(100), 
                 z = rep(0,100))
df$z[75] <- 1

freq = c(0.01, 0.495, 0.495)
weights = rbind(c(0,0,1), c(0,0,1), c(1,1,0))

mdf <- ampute(df, 
              freq = freq, 
              weights=weights)

# resulting pattern
mdf$patterns
#>   x y z
#> 1 0 1 1
#> 2 1 0 1
#> 3 1 1 0

# resulting mechanism
mdf$mech
#> [1] "MAR"
mdf$type
#> [1] "RIGHT" "RIGHT" "RIGHT"

# customized weights
mdf$weights
#>   x y z
#> 1 0 0 1
#> 2 0 0 1
#> 3 1 1 0

# pattern
md.pattern(mdf$amp)

#>    x  z  y   
#> 59 1  1  1  0
#> 25 1  1  0  1
#> 16 1  0  1  1
#>    0 16 25 41

Created on 2021-03-22 by the reprex package (v1.0.0)

Please note that in the above reprex the issue is not replicated.

I expect that, given your sparse data, the chosen seed value, freq vector and weights matrix; there is indeed an edge case where the single value in z is taken as the reference for a single to-be-generated missing value in x. Different seeds yield expected results - even when z is equally extremely underpopulated. Different weights matrices and frequency vectors too.

You're trying to fit a block through a triangular hole, but mathematically you've restricted the bucket of blocks to be just squares. This is in part due to chance (the seed value) and in part due to a very restrictive set of arguments.

More general Choosing a MCAR mechanism for the first pattern would solve the problem. A missing data frequency of .01 in x and the specified weight on a most unbalanced column z would result in a spurious MAR mechanism anyways and the resulting missingness and impact on inferences would be indistinguishable from MCAR - in this case. Consider it MCAR with extra steps.

Conclusion I'd rather keep this behaviour than ironing it out as it makes the user painfully aware that the missingness generation would be plagued by the same limitations as the imputation.

colinorourke commented 3 years ago

You make a great point, but my worry would be that one asks for ampute to generate missingness with a certain level and pattern and it returns something that doesn't conform to that specification but doesn't issue a message or warning. There could be siuations (e.g., large simulation experiments) where this issue goes completely undetected and spurious results are produced. Depending on what's being done with the resulting incomplete data, the user may be painlessly unaware something's gone wrong!

stefvanbuuren commented 3 years ago

Could reproduce with set.seed(1L), as suggested by the OP. The solution with set.seed(2L) looks normal.

I believe that this is an edge case that is unlikely to surface in practice. For me, the question of interest would be whether there is a mathematical necessity that requires all x's to be set to missing. If we cannot think of such a reason, we have an unsuspected result. It may be damaging or nor, but until we know this, we may wish to keep it open.

colinorourke commented 3 years ago

Firstly, I'd like to thank @gerkovink for demonstrating the reprex package. Brilliant! Just thought I'd provide another example where I see the behaviour of ampute being not what I expected. While also contrived, in some ways this second example seems to me less contrived.

library(mice)
#> 
#> Attaching package: 'mice'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following objects are masked from 'package:base':
#> 
#>     cbind, rbind
library(magrittr)

set.seed(890L)

# Create dataset with 10 binary indicators
df <- lapply(
  runif(10, 0.05, 0.15),
  function(p, n) rbinom(n, 1, p),
  n = 100L
) %>%
  do.call(what = "data.frame") %>%
  set_names(paste0("type", LETTERS[1:ncol(.)]))

# Top of original dataset
head(df)
#>   typeA typeB typeC typeD typeE typeF typeG typeH typeI typeJ
#> 1     0     0     0     0     0     0     0     0     0     0
#> 2     0     0     0     0     0     0     0     0     0     0
#> 3     0     0     0     1     0     0     0     0     0     0
#> 4     0     0     1     0     0     0     1     0     0     0
#> 5     0     0     0     0     0     0     0     0     1     0
#> 6     0     0     0     1     0     0     1     0     0     0

# Number of "hits" in each column
colSums(df)
#> typeA typeB typeC typeD typeE typeF typeG typeH typeI typeJ 
#>     3    13    11    16    18     4    11     9    18     8

# Patterns to consider for missing data. Here we have all possible
# ways of creating 1 or 2 missing values
patterns <- do.call(
  expand.grid,
  lapply(df, function(...) c(0L, 1L))
) %>%
  extract(rowSums(.) >= ncol(df) - 2L &
    rowSums(.) < ncol(df), )

# Weight related to binomial probabilities
# such that less missing is favoured
pattern.freq <- rowSums(patterns) %>%
  {
    .75^. * .25^(ncol(df) - .)
  } %>%
  divide_by(sum(.))

# Run ampute with these settings
dfc <- ampute(df, patterns = patterns, freq = pattern.freq)

# Top of data from ampute
head(dfc$amp)
#>   typeA typeB typeC typeD typeE typeF typeG typeH typeI typeJ
#> 1     0    NA     0    NA    NA    NA    NA    NA    NA    NA
#> 2     0    NA     0    NA    NA    NA    NA    NA    NA    NA
#> 3     0    NA     0    NA    NA    NA    NA    NA    NA    NA
#> 4     0    NA     1    NA    NA    NA    NA    NA    NA    NA
#> 5     0    NA     0    NA    NA    NA    NA    NA    NA    NA
#> 6     0    NA    NA    NA    NA    NA    NA    NA    NA    NA

# Missing data patterns from ampute
md.pattern(dfc$amp)

#>    typeA typeC typeB typeD typeE typeF typeG typeH typeI typeJ    
#> 87     1     1     0     0     0     0     0     0     0     0   8
#> 8      1     0     0     0     0     0     0     0     0     0   9
#> 4      0     1     0     0     0     0     0     0     0     0   9
#> 1      0     0     0     0     0     0     0     0     0     0  10
#>        5     9   100   100   100   100   100   100   100   100 814

Created on 2021-03-24 by the reprex package (v1.0.0)

gerkovink commented 3 years ago

@RianneSchouten Could you help identify the source of this behaviour?

It seems to disappear when the data is sufficiently large:

library(mice)
library(magrittr)
set.seed(123)

# Create dataset with 10 binary indicators
df <- lapply(
  runif(10, 0.05, 0.15),
  function(p, n) rbinom(n, 1, p),
  n = 500L
) %>%
  do.call(what = "data.frame") %>%
  set_names(paste0("type", LETTERS[1:ncol(.)]))

# Top of original dataset
head(df)
#>   typeA typeB typeC typeD typeE typeF typeG typeH typeI typeJ
#> 1     1     0     0     0     0     0     1     1     0     0
#> 2     0     0     0     1     0     0     0     0     1     0
#> 3     0     0     0     0     0     0     0     0     0     0
#> 4     0     0     0     0     0     0     0     0     0     0
#> 5     0     0     0     0     1     0     0     0     0     0
#> 6     0     0     0     0     0     0     0     0     1     0

# Number of "hits" in each column
colSums(df)
#> typeA typeB typeC typeD typeE typeF typeG typeH typeI typeJ 
#>    39    61    40    69    72    19    52    83    41    40

# Patterns to consider for missing data. Here we have all possible
# ways of creating 1 or 2 missing values
patterns <- do.call(
  expand.grid,
  lapply(df, function(...) c(0L, 1L))
) %>%
  extract(rowSums(.) >= ncol(df) - 2L &
            rowSums(.) < ncol(df), )

# Weight related to binomial probabilities
# such that less missing is favoured
pattern.freq <- rowSums(patterns) %>%
  {
    .75^. * .25^(ncol(df) - .)
  } %>%
  divide_by(sum(.))

# Run ampute with these settings
dfc <- ampute(df, patterns = patterns, freq = pattern.freq)

# Top of data from ampute
head(dfc$amp)
#>   typeA typeB typeC typeD typeE typeF typeG typeH typeI typeJ
#> 1     1     0     0    NA     0     0     1     1     0     0
#> 2    NA    NA     0     1     0     0     0     0     1     0
#> 3     0     0     0     0     0     0     0     0     0     0
#> 4     0     0     0     0     0     0     0     0     0     0
#> 5     0     0     0     0     1     0     0     0     0     0
#> 6     0     0     0     0     0     0     0     0     1     0

# Missing data patterns from ampute
md.pattern(dfc$amp)

#>     typeC typeF typeG typeA typeI typeH typeB typeJ typeD typeE    
#> 271     1     1     1     1     1     1     1     1     1     1   0
#> 10      1     1     1     1     1     1     1     1     1     0   1
#> 11      1     1     1     1     1     1     1     1     0     1   1
#> 7       1     1     1     1     1     1     1     1     0     0   2
#> 13      1     1     1     1     1     1     1     0     1     1   1
#> 1       1     1     1     1     1     1     1     0     1     0   2
#> 3       1     1     1     1     1     1     1     0     0     1   2
#> 9       1     1     1     1     1     1     0     1     1     1   1
#> 4       1     1     1     1     1     1     0     1     1     0   2
#> 5       1     1     1     1     1     1     0     1     0     1   2
#> 3       1     1     1     1     1     1     0     0     1     1   2
#> 13      1     1     1     1     1     0     1     1     1     1   1
#> 2       1     1     1     1     1     0     1     1     1     0   2
#> 1       1     1     1     1     1     0     1     1     0     1   2
#> 3       1     1     1     1     1     0     1     0     1     1   2
#> 2       1     1     1     1     1     0     0     1     1     1   2
#> 10      1     1     1     1     0     1     1     1     1     1   1
#> 3       1     1     1     1     0     1     1     1     1     0   2
#> 4       1     1     1     1     0     1     1     1     0     1   2
#> 2       1     1     1     1     0     1     1     0     1     1   2
#> 3       1     1     1     1     0     1     0     1     1     1   2
#> 4       1     1     1     1     0     0     1     1     1     1   2
#> 6       1     1     1     0     1     1     1     1     1     1   1
#> 5       1     1     1     0     1     1     1     1     1     0   2
#> 3       1     1     1     0     1     1     1     1     0     1   2
#> 3       1     1     1     0     1     1     1     0     1     1   2
#> 6       1     1     1     0     1     1     0     1     1     1   2
#> 3       1     1     1     0     1     0     1     1     1     1   2
#> 9       1     1     0     1     1     1     1     1     1     1   1
#> 2       1     1     0     1     1     1     1     1     1     0   2
#> 3       1     1     0     1     1     1     1     1     0     1   2
#> 5       1     1     0     1     1     1     1     0     1     1   2
#> 3       1     1     0     1     1     1     0     1     1     1   2
#> 3       1     1     0     1     1     0     1     1     1     1   2
#> 1       1     1     0     1     0     1     1     1     1     1   2
#> 2       1     1     0     0     1     1     1     1     1     1   2
#> 6       1     0     1     1     1     1     1     1     1     1   1
#> 3       1     0     1     1     1     1     1     1     1     0   2
#> 2       1     0     1     1     1     1     1     1     0     1   2
#> 2       1     0     1     1     1     1     1     0     1     1   2
#> 2       1     0     1     1     1     1     0     1     1     1   2
#> 4       1     0     1     1     1     0     1     1     1     1   2
#> 5       1     0     1     1     0     1     1     1     1     1   2
#> 5       1     0     1     0     1     1     1     1     1     1   2
#> 1       1     0     0     1     1     1     1     1     1     1   2
#> 7       0     1     1     1     1     1     1     1     1     1   1
#> 4       0     1     1     1     1     1     1     1     1     0   2
#> 2       0     1     1     1     1     1     1     1     0     1   2
#> 4       0     1     1     1     1     1     1     0     1     1   2
#> 2       0     1     1     1     1     1     0     1     1     1   2
#> 3       0     1     1     1     1     0     1     1     1     1   2
#> 2       0     1     1     1     0     1     1     1     1     1   2
#> 1       0     1     1     0     1     1     1     1     1     1   2
#> 4       0     1     0     1     1     1     1     1     1     1   2
#> 3       0     0     1     1     1     1     1     1     1     1   2
#>        32    33    33    34    34    38    39    39    41    41 364

Created on 2021-03-26 by the reprex package (v1.0.0)

RianneSchouten commented 3 years ago

Dear all,

As far as I can see, the problems occur from the same issue: one or more patterns do not have enough data points assigned to them. In example 1, it happens because the first pattern has a very low frequency value and with some seeds, that will result in an empty pattern. In example 2, it happens because the number of patterns is large compared to the number of data points.

You can inspect the assignment of data points to patterns with dfc$cand.

In addition, there is a problem with the variation of the weighted sum scores within one pattern. If there is little to no variation in the weighted sum scores, it is difficult to know which data points should be amputed. There is little variation if there is only 1 data point in the pattern, or if there are a few data points with similar weighted sum scores. The latter occurs more often if the dataset is binary.

You can inspect the weighted sum scores with dfc$scores.

To be fair, the issue with empty and almost-empty patterns is actually a drawback of the amputation methodology, although the solution in ampute could be improved upon. Currently, in case of little variation, ampute assigns to every data point the missingness proportion as the amputation probability. This is done in ampute.continuous on lines 65-68. The idea is that we still ensure that the required amount of data points is amputed (although it would become more MCAR than MAR). Obviously, as shown in example 2, with a large number of empty to almost-empty patterns, too many data points are amputed.

As a solution, we could add a warning (or a stop) when:

  1. There exists an empty pattern.
  2. There exist patterns with fewer data points than a certain threshold (possibly based on the missingness proportion).

Let me know what you would like.

Kind regards, Rianne Schouten

gerkovink commented 3 years ago

Thanks @RianneSchouten. I'd prefer warnings in both scenarios. Warnings can be ignored if the user desires. Stops are so definitive.

RianneSchouten commented 2 years ago

I made three warnings (and two unit tests) that should let the user know that the dataset does not match with the desired number of patterns, or that the weighted sum scores do not have enough variation to generate MAR/MNAR missingness. I made a pull request for these warnings in #451.

stefvanbuuren commented 2 years ago

Closing because it's now in master branch