Closed colinorourke closed 2 years ago
Thanks for your question.
You're weighting a pattern in such a way that - dependent on the seed, the function arguments and your data - there may be no room for the algorithm to induce missingness. Take e.g. the first pattern, where x
may be missing, but only based on z
. But z
is more or less constant as you've noted.
set.seed(123)
library(mice)
#>
#> Attaching package: 'mice'
#> The following object is masked from 'package:stats':
#>
#> filter
#> The following objects are masked from 'package:base':
#>
#> cbind, rbind
df <- data.frame(x = rnorm(100),
y = rnorm(100),
z = rep(0,100))
df$z[75] <- 1
freq = c(0.01, 0.495, 0.495)
weights = rbind(c(0,0,1), c(0,0,1), c(1,1,0))
mdf <- ampute(df,
freq = freq,
weights=weights)
# resulting pattern
mdf$patterns
#> x y z
#> 1 0 1 1
#> 2 1 0 1
#> 3 1 1 0
# resulting mechanism
mdf$mech
#> [1] "MAR"
mdf$type
#> [1] "RIGHT" "RIGHT" "RIGHT"
# customized weights
mdf$weights
#> x y z
#> 1 0 0 1
#> 2 0 0 1
#> 3 1 1 0
# pattern
md.pattern(mdf$amp)
#> x z y
#> 59 1 1 1 0
#> 25 1 1 0 1
#> 16 1 0 1 1
#> 0 16 25 41
Created on 2021-03-22 by the reprex package (v1.0.0)
Please note that in the above reprex
the issue is not replicated.
I expect that, given your sparse data, the chosen seed value, freq
vector and weights
matrix; there is indeed an edge case where the single value in z
is taken as the reference for a single to-be-generated missing value in x
. Different seeds yield expected results - even when z
is equally extremely underpopulated. Different weights
matrices and frequency
vectors too.
You're trying to fit a block through a triangular hole, but mathematically you've restricted the bucket of blocks to be just squares. This is in part due to chance (the seed value) and in part due to a very restrictive set of arguments.
More general
Choosing a MCAR mechanism for the first pattern would solve the problem. A missing data frequency of .01 in x
and the specified weight on a most unbalanced column z
would result in a spurious MAR mechanism anyways and the resulting missingness and impact on inferences would be indistinguishable from MCAR - in this case. Consider it MCAR with extra steps.
Conclusion I'd rather keep this behaviour than ironing it out as it makes the user painfully aware that the missingness generation would be plagued by the same limitations as the imputation.
You make a great point, but my worry would be that one asks for ampute to generate missingness with a certain level and pattern and it returns something that doesn't conform to that specification but doesn't issue a message or warning. There could be siuations (e.g., large simulation experiments) where this issue goes completely undetected and spurious results are produced. Depending on what's being done with the resulting incomplete data, the user may be painlessly unaware something's gone wrong!
Could reproduce with set.seed(1L)
, as suggested by the OP. The solution with set.seed(2L)
looks normal.
I believe that this is an edge case that is unlikely to surface in practice. For me, the question of interest would be whether there is a mathematical necessity that requires all x
's to be set to missing. If we cannot think of such a reason, we have an unsuspected result. It may be damaging or nor, but until we know this, we may wish to keep it open.
Firstly, I'd like to thank @gerkovink for demonstrating the reprex
package. Brilliant! Just thought I'd provide another example where I see the behaviour of ampute
being not what I expected. While also contrived, in some ways this second example seems to me less contrived.
library(mice)
#>
#> Attaching package: 'mice'
#> The following object is masked from 'package:stats':
#>
#> filter
#> The following objects are masked from 'package:base':
#>
#> cbind, rbind
library(magrittr)
set.seed(890L)
# Create dataset with 10 binary indicators
df <- lapply(
runif(10, 0.05, 0.15),
function(p, n) rbinom(n, 1, p),
n = 100L
) %>%
do.call(what = "data.frame") %>%
set_names(paste0("type", LETTERS[1:ncol(.)]))
# Top of original dataset
head(df)
#> typeA typeB typeC typeD typeE typeF typeG typeH typeI typeJ
#> 1 0 0 0 0 0 0 0 0 0 0
#> 2 0 0 0 0 0 0 0 0 0 0
#> 3 0 0 0 1 0 0 0 0 0 0
#> 4 0 0 1 0 0 0 1 0 0 0
#> 5 0 0 0 0 0 0 0 0 1 0
#> 6 0 0 0 1 0 0 1 0 0 0
# Number of "hits" in each column
colSums(df)
#> typeA typeB typeC typeD typeE typeF typeG typeH typeI typeJ
#> 3 13 11 16 18 4 11 9 18 8
# Patterns to consider for missing data. Here we have all possible
# ways of creating 1 or 2 missing values
patterns <- do.call(
expand.grid,
lapply(df, function(...) c(0L, 1L))
) %>%
extract(rowSums(.) >= ncol(df) - 2L &
rowSums(.) < ncol(df), )
# Weight related to binomial probabilities
# such that less missing is favoured
pattern.freq <- rowSums(patterns) %>%
{
.75^. * .25^(ncol(df) - .)
} %>%
divide_by(sum(.))
# Run ampute with these settings
dfc <- ampute(df, patterns = patterns, freq = pattern.freq)
# Top of data from ampute
head(dfc$amp)
#> typeA typeB typeC typeD typeE typeF typeG typeH typeI typeJ
#> 1 0 NA 0 NA NA NA NA NA NA NA
#> 2 0 NA 0 NA NA NA NA NA NA NA
#> 3 0 NA 0 NA NA NA NA NA NA NA
#> 4 0 NA 1 NA NA NA NA NA NA NA
#> 5 0 NA 0 NA NA NA NA NA NA NA
#> 6 0 NA NA NA NA NA NA NA NA NA
# Missing data patterns from ampute
md.pattern(dfc$amp)
#> typeA typeC typeB typeD typeE typeF typeG typeH typeI typeJ
#> 87 1 1 0 0 0 0 0 0 0 0 8
#> 8 1 0 0 0 0 0 0 0 0 0 9
#> 4 0 1 0 0 0 0 0 0 0 0 9
#> 1 0 0 0 0 0 0 0 0 0 0 10
#> 5 9 100 100 100 100 100 100 100 100 814
Created on 2021-03-24 by the reprex package (v1.0.0)
@RianneSchouten Could you help identify the source of this behaviour?
It seems to disappear when the data is sufficiently large:
library(mice)
library(magrittr)
set.seed(123)
# Create dataset with 10 binary indicators
df <- lapply(
runif(10, 0.05, 0.15),
function(p, n) rbinom(n, 1, p),
n = 500L
) %>%
do.call(what = "data.frame") %>%
set_names(paste0("type", LETTERS[1:ncol(.)]))
# Top of original dataset
head(df)
#> typeA typeB typeC typeD typeE typeF typeG typeH typeI typeJ
#> 1 1 0 0 0 0 0 1 1 0 0
#> 2 0 0 0 1 0 0 0 0 1 0
#> 3 0 0 0 0 0 0 0 0 0 0
#> 4 0 0 0 0 0 0 0 0 0 0
#> 5 0 0 0 0 1 0 0 0 0 0
#> 6 0 0 0 0 0 0 0 0 1 0
# Number of "hits" in each column
colSums(df)
#> typeA typeB typeC typeD typeE typeF typeG typeH typeI typeJ
#> 39 61 40 69 72 19 52 83 41 40
# Patterns to consider for missing data. Here we have all possible
# ways of creating 1 or 2 missing values
patterns <- do.call(
expand.grid,
lapply(df, function(...) c(0L, 1L))
) %>%
extract(rowSums(.) >= ncol(df) - 2L &
rowSums(.) < ncol(df), )
# Weight related to binomial probabilities
# such that less missing is favoured
pattern.freq <- rowSums(patterns) %>%
{
.75^. * .25^(ncol(df) - .)
} %>%
divide_by(sum(.))
# Run ampute with these settings
dfc <- ampute(df, patterns = patterns, freq = pattern.freq)
# Top of data from ampute
head(dfc$amp)
#> typeA typeB typeC typeD typeE typeF typeG typeH typeI typeJ
#> 1 1 0 0 NA 0 0 1 1 0 0
#> 2 NA NA 0 1 0 0 0 0 1 0
#> 3 0 0 0 0 0 0 0 0 0 0
#> 4 0 0 0 0 0 0 0 0 0 0
#> 5 0 0 0 0 1 0 0 0 0 0
#> 6 0 0 0 0 0 0 0 0 1 0
# Missing data patterns from ampute
md.pattern(dfc$amp)
#> typeC typeF typeG typeA typeI typeH typeB typeJ typeD typeE
#> 271 1 1 1 1 1 1 1 1 1 1 0
#> 10 1 1 1 1 1 1 1 1 1 0 1
#> 11 1 1 1 1 1 1 1 1 0 1 1
#> 7 1 1 1 1 1 1 1 1 0 0 2
#> 13 1 1 1 1 1 1 1 0 1 1 1
#> 1 1 1 1 1 1 1 1 0 1 0 2
#> 3 1 1 1 1 1 1 1 0 0 1 2
#> 9 1 1 1 1 1 1 0 1 1 1 1
#> 4 1 1 1 1 1 1 0 1 1 0 2
#> 5 1 1 1 1 1 1 0 1 0 1 2
#> 3 1 1 1 1 1 1 0 0 1 1 2
#> 13 1 1 1 1 1 0 1 1 1 1 1
#> 2 1 1 1 1 1 0 1 1 1 0 2
#> 1 1 1 1 1 1 0 1 1 0 1 2
#> 3 1 1 1 1 1 0 1 0 1 1 2
#> 2 1 1 1 1 1 0 0 1 1 1 2
#> 10 1 1 1 1 0 1 1 1 1 1 1
#> 3 1 1 1 1 0 1 1 1 1 0 2
#> 4 1 1 1 1 0 1 1 1 0 1 2
#> 2 1 1 1 1 0 1 1 0 1 1 2
#> 3 1 1 1 1 0 1 0 1 1 1 2
#> 4 1 1 1 1 0 0 1 1 1 1 2
#> 6 1 1 1 0 1 1 1 1 1 1 1
#> 5 1 1 1 0 1 1 1 1 1 0 2
#> 3 1 1 1 0 1 1 1 1 0 1 2
#> 3 1 1 1 0 1 1 1 0 1 1 2
#> 6 1 1 1 0 1 1 0 1 1 1 2
#> 3 1 1 1 0 1 0 1 1 1 1 2
#> 9 1 1 0 1 1 1 1 1 1 1 1
#> 2 1 1 0 1 1 1 1 1 1 0 2
#> 3 1 1 0 1 1 1 1 1 0 1 2
#> 5 1 1 0 1 1 1 1 0 1 1 2
#> 3 1 1 0 1 1 1 0 1 1 1 2
#> 3 1 1 0 1 1 0 1 1 1 1 2
#> 1 1 1 0 1 0 1 1 1 1 1 2
#> 2 1 1 0 0 1 1 1 1 1 1 2
#> 6 1 0 1 1 1 1 1 1 1 1 1
#> 3 1 0 1 1 1 1 1 1 1 0 2
#> 2 1 0 1 1 1 1 1 1 0 1 2
#> 2 1 0 1 1 1 1 1 0 1 1 2
#> 2 1 0 1 1 1 1 0 1 1 1 2
#> 4 1 0 1 1 1 0 1 1 1 1 2
#> 5 1 0 1 1 0 1 1 1 1 1 2
#> 5 1 0 1 0 1 1 1 1 1 1 2
#> 1 1 0 0 1 1 1 1 1 1 1 2
#> 7 0 1 1 1 1 1 1 1 1 1 1
#> 4 0 1 1 1 1 1 1 1 1 0 2
#> 2 0 1 1 1 1 1 1 1 0 1 2
#> 4 0 1 1 1 1 1 1 0 1 1 2
#> 2 0 1 1 1 1 1 0 1 1 1 2
#> 3 0 1 1 1 1 0 1 1 1 1 2
#> 2 0 1 1 1 0 1 1 1 1 1 2
#> 1 0 1 1 0 1 1 1 1 1 1 2
#> 4 0 1 0 1 1 1 1 1 1 1 2
#> 3 0 0 1 1 1 1 1 1 1 1 2
#> 32 33 33 34 34 38 39 39 41 41 364
Created on 2021-03-26 by the reprex package (v1.0.0)
Dear all,
As far as I can see, the problems occur from the same issue: one or more patterns do not have enough data points assigned to them. In example 1, it happens because the first pattern has a very low frequency value and with some seeds, that will result in an empty pattern. In example 2, it happens because the number of patterns is large compared to the number of data points.
You can inspect the assignment of data points to patterns with dfc$cand
.
In addition, there is a problem with the variation of the weighted sum scores within one pattern. If there is little to no variation in the weighted sum scores, it is difficult to know which data points should be amputed. There is little variation if there is only 1 data point in the pattern, or if there are a few data points with similar weighted sum scores. The latter occurs more often if the dataset is binary.
You can inspect the weighted sum scores with dfc$scores
.
To be fair, the issue with empty and almost-empty patterns is actually a drawback of the amputation methodology, although the solution in ampute
could be improved upon. Currently, in case of little variation, ampute
assigns to every data point the missingness proportion as the amputation probability. This is done in ampute.continuous
on lines 65-68. The idea is that we still ensure that the required amount of data points is amputed (although it would become more MCAR than MAR). Obviously, as shown in example 2, with a large number of empty to almost-empty patterns, too many data points are amputed.
As a solution, we could add a warning (or a stop) when:
Let me know what you would like.
Kind regards, Rianne Schouten
Thanks @RianneSchouten. I'd prefer warnings in both scenarios. Warnings can be ignored if the user desires. Stops are so definitive.
I made three warnings (and two unit tests) that should let the user know that the dataset does not match with the desired number of patterns, or that the weighted sum scores do not have enough variation to generate MAR/MNAR missingness. I made a pull request for these warnings in #451.
Closing because it's now in master branch
There seems to be a bug in the ampute function that is encountered as sort of an edge case.
Let's say you have a variable which is an indicator taking on the values 0/1, and the value 1 is a bit on the rare side.
And for a particular missing data pattern occurring with lower frequency you give score weights depending only on the indicator value.
You have a good chance of drawing for that pattern a single score of 0. This gets interpreted by part of the code as a pattern that doesn't occur, which causes that entire variable to be set to missing when amputed.