bnowok / synthpop

Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control
40 stars 8 forks source link

NaN lead to error in remove.lindep.syn #1

Closed rubenarslan closed 7 years ago

rubenarslan commented 7 years ago

Hi. I've encountered a problem in remove.lindep.syn when my data has values of NaN. Maybe the package could warn about these kinds of values, as they're often produced inadvertently.

In my case I could get rid of them easily using diary = diary %>% mutate_all(funs(ifelse(!is.nan(.),., NA))).

data("iris")
library(synthpop)
#> Loading required package: lattice
#> Loading required package: MASS
#> Loading required package: nnet
#> Loading required package: ggplot2
iris2 = iris
iris[1, 1] = NA
x = syn(iris)
#> syn  variables
#> 1    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
iris2[1, 3] = NA
x = syn(iris2)
#> syn  variables
#> 1    Sepal.Length Sepal.Width Petal.Length Petal.Width Species

iris[1, 1] = NaN
syn(iris)
#> syn  variables
#> 1    Sepal.Length Sepal.Width
#> Warning in remove.lindep.syn(x, y, ...): All predictors are constant or
#> have too high correlation.
#> Error in cor(xobs[, keep, drop = FALSE], use = "all.obs"): 'x' is empty
iris2[1, 3] = NaN
syn(iris2)
#> syn  variables
#> 1    Sepal.Length Sepal.Width Petal.Length
#> Error in if (all(!keep)) warning("All predictors are constant or have too high correlation."): missing value where TRUE/FALSE needed
bnowok commented 7 years ago

If numeric variables have missing data codes different from NA you can specify them using cont.na parameter of syn() function, e.g. for your last synthesis syn(iris2, cont.na = list(Sepal.Length = NaN, Petal.Length = NaN)) or syn(iris2, cont.na = list(Sepal.Length = c(NA, NaN), Petal.Length = c(NA, NaN))) if NA code is also present.

We might consider automatic treatment of NaN values at a later stage.