bnowok / synthpop

Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control
40 stars 8 forks source link

Feature request: restrict combinations of values in the synthetic data to combinations appearing in the real data/ #22

Open LotteVanUtrecht opened 2 years ago

LotteVanUtrecht commented 2 years ago

We are synthesizing a dataset with two related variables: "onderwijsstructuur" & "owsoort" (which in this case indicate information about a school and an individual student respectively). We would like the synthetic data to only include combinations of those two variables that are present in the real data. Part of the crosstable between variables is included below.

image

If you only look at the second row (the case where "onderwijsstructuur"=="HAVO"), this problem is easily solved. Just give syn() a rule and rvalue that looks something like this: params[["rules"]] <- list(owsoort='"onderwijsstructuur"=="HAVO"') params[["rvalues"]] <- list(owsoort='HAVO')

However, when we want to include the cases in the fourth row (where "onderwijsstructuur"=="MAVO"), we run into two problems:

  1. rvalues doesn't allow two rules for the same variable.
  2. more problematically to us, rvalues doesn't allow a non-deterministic restriction. e.g. setting "owsoort %in% c("BRJ","VMBO")".

It's possible that you can already construct a good alternative with the current features of the package and we just overlooked that. For some cases, synthesizing the two variables together with the 'catall' is a good alternative. However, that will not work here, as "onderwijsstructuur" is already synthesized together with other variables and we feel that including "owsoort" in there would take too much personal information from single individuals.

Best, Lotte

gillian-raab commented 2 years ago

Can I suggest you try this. Put these two variables at the start of your synthesis, For these two variables use the method "catall", Define the empty cells as structural zeros - see the catall documentation for how to do it. Then synthesise the rest of your variables as usual.

Good luck and let me know if this works. Gillian

Gillian M Raab

Emeritus Professor, Edinburgh Napier University

Part-time Research Fellow

Administrative Data Research Centre - Scotland

Edinburgh

+44 7748 678 551


From: LotteVanUtrecht @.> Sent: 15 June 2022 15:24 To: bnowok/synthpop @.> Cc: Subscribed @.***> Subject: [bnowok/synthpop] Feature request: restrict combinations of values in the synthetic data to combinations appearing in the real data/ (Issue #22)

This email was sent to you by someone outside the University. You should only click on links or attachments if you are certain that the email is genuine and the content is safe.

We are synthesizing a dataset with two related variables: "onderwijsstructuur" & "owsoort" (which in this case indicate information about a school and an individual student respectively). We would like the synthetic data to only include combinations of those two variables that are present in the real data. Part of the crosstable between variables is included below.

[image]https://user-images.githubusercontent.com/45172124/173846615-638e0102-1277-4d72-83d0-9f0c69bb073c.png

If you only look at the second row (the case where "onderwijsstructuur"=="HAVO"), this problem is easily solved. Just give syn() a rule and rvalue that looks something like this: params[["rules"]] <- list(owsoort='"onderwijsstructuur"=="HAVO"') params[["rvalues"]] <- list(owsoort='HAVO')

However, when we want to include the cases in the fourth row (where "onderwijsstructuur"=="MAVO"), we run into two problems:

  1. rvalues doesn't allow two rules for the same variable.
  2. more problematically to us, rvalues doesn't allow a non-deterministic restriction. e.g. setting "owsoort %in% c("BRJ","VMBO")".

It's possible that you can already construct a good alternative with the current features of the package and we just overlooked that. For some cases, synthesizing the two variables together with the 'catall' is a good alternative. However, that will not work here, as "onderwijsstructuur" is already synthesized together with other variables and we feel that including "owsoort" in there would take too much personal information from single individuals.

Best, Lotte

— Reply to this email directly, view it on GitHubhttps://github.com/bnowok/synthpop/issues/22, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE3HB7GRGWUCOAR2PVEAUXDVPHRTFANCNFSM5Y3QHSZA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.

gillian-raab commented 2 years ago

I've just read your email more carefully and I see that you have already thought about the catall option. Do you have to synthesise "onderwijsstructuur" first. Could it not come later? Best Gillian

Gillian M Raab

Emeritus Professor, Edinburgh Napier University

Part-time Research Fellow

Administrative Data Research Centre - Scotland

Edinburgh

+44 7748 678 551


From: LotteVanUtrecht @.> Sent: 15 June 2022 15:24 To: bnowok/synthpop @.> Cc: Subscribed @.***> Subject: [bnowok/synthpop] Feature request: restrict combinations of values in the synthetic data to combinations appearing in the real data/ (Issue #22)

This email was sent to you by someone outside the University. You should only click on links or attachments if you are certain that the email is genuine and the content is safe.

We are synthesizing a dataset with two related variables: "onderwijsstructuur" & "owsoort" (which in this case indicate information about a school and an individual student respectively). We would like the synthetic data to only include combinations of those two variables that are present in the real data. Part of the crosstable between variables is included below.

[image]https://user-images.githubusercontent.com/45172124/173846615-638e0102-1277-4d72-83d0-9f0c69bb073c.png

If you only look at the second row (the case where "onderwijsstructuur"=="HAVO"), this problem is easily solved. Just give syn() a rule and rvalue that looks something like this: params[["rules"]] <- list(owsoort='"onderwijsstructuur"=="HAVO"') params[["rvalues"]] <- list(owsoort='HAVO')

However, when we want to include the cases in the fourth row (where "onderwijsstructuur"=="MAVO"), we run into two problems:

  1. rvalues doesn't allow two rules for the same variable.
  2. more problematically to us, rvalues doesn't allow a non-deterministic restriction. e.g. setting "owsoort %in% c("BRJ","VMBO")".

It's possible that you can already construct a good alternative with the current features of the package and we just overlooked that. For some cases, synthesizing the two variables together with the 'catall' is a good alternative. However, that will not work here, as "onderwijsstructuur" is already synthesized together with other variables and we feel that including "owsoort" in there would take too much personal information from single individuals.

Best, Lotte

— Reply to this email directly, view it on GitHubhttps://github.com/bnowok/synthpop/issues/22, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE3HB7GRGWUCOAR2PVEAUXDVPHRTFANCNFSM5Y3QHSZA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.