bnowok / synthpop

Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control
40 stars 8 forks source link

Wrong synthesis of dependent column? #7

Closed jchalvorsen closed 6 years ago

jchalvorsen commented 6 years ago

I have a large dataset that need synthesising, where one of the columns depends on a columns that comes later. That problem we solved by using a custom visit.sequence. Now the problem seems to be that the dependant variable is not respected. Attached below is a minimal working example:

library(synthpop)
ods <- SD2011[,  c("depress", "smoke")]
# Force a certain level of depress in all smokers.
ods$depress[which(ods$smoke == "YES")] <- 0
# Convert to factor, since my usecase considers two factor columns.
ods$depress <- as.factor(ods$depress)

# Syntethise dataset
sds.default <- syn(ods, visit.sequence = c(2,1), method = c("cart", "sample"))

# This should return an empty list if alle smokers have depress = 0
which(sds.default$syn$smoke == "YES" & sds.default$syn$depress != 0  )

# as it does here:
which(ods$smoke == "YES" & ods$depress != 0  )

Do you have any idea how this can be solved? If i create the same example from a column that is a factor, it seems to work, as this example shows:

ods <- SD2011[,  c("marital", "smoke")]
ods$marital[which(ods$smoke == "YES")] <- "SINGLE"

sds.default <- syn(ods, visit.sequence = c(2,1), method = c("cart", "sample"))
which(sds.default$syn$smoke == "YES" & sds.default$syn$marital != "SINGLE" )
bnowok commented 6 years ago

If some values of a variable are determined explicitly by values of other variables, the rules and the corresponding values can be specified using rules and rvalues parameters. They should be supplied in the form of named lists. Is it your case? See below for an example.

ods <- SD2011[,  c("depress", "smoke")]
ods$depress[which(ods$smoke == "YES")] <- 0
ods$depress <- as.factor(ods$depress)

sds <- syn(ods, visit.sequence = c(2,1), 
           method = c("cart", "sample"),
           rules = list(depress = "smoke == 'YES'"),
           rvalues = list(depress = 0))

with(ods, table(depress, smoke))
with(sds$syn, table(depress, smoke))