Experiments: utility vs privacy (as measured by disclosure risk)

kasra-hosseini commented 4 years ago

Generate different synthetic datasets using synthpop and CTGAN, and compare utility and privacy metrics.

kasra-hosseini commented 4 years ago

Experiment 1

Synthetic method: synthpop
Dataset: generator-outputs/odi-nhs-ae/hospital_ae_data_deidentify
random_state: 12345
Columns (original data): "Time in A&E (mins)","Treatment","Gender","Index of Multiple Deprivation Decile","Hospital ID","Arrival Date","Arrival hour range","Age bracket"
vars_sequence: [5, 3, 8, 1]
num_samples_intruder: 3000
vars_intruder: ["Treatment", "Gender", "Age bracket"]

Five synthetic datasets were generated which only differ in synthesis_method:

1: "synthesis_methods": ["sample", "", "sample", "", "sample", "", "", "sample"], ---> randomly sampled (all 4 columns) 2: "synthesis_methods": ["cart", "", "cart", "", "", "", "", "cart"], 3: "synthesis_methods": ["cart", "", "", "", "", "", "", "cart"], 4: "synthesis_methods": ["cart", "", "", "", "", "", "", ""], 5: "synthesis_methods": ["", "", "", "", "", "", "", ""], ---> copy of the original data

From 1 (randomly sampled) ---> 5, the synthetic dataset gradually becomes more similar to the original dataset.

Results:

util_privacy

Utility: weighted-f1 score, 100 means the f1-score of a model trained on a synthetic dataset is the same as a model trained on the original one. Privacy: 100 * (1 - expected_match_risk/num_samples_intruder)

kasra-hosseini commented 4 years ago

Experiment 2

The same as Experiment 1 except:

random_state: 54321

Results:

util_privacy_2

kasra-hosseini commented 4 years ago

Experiment 1 was repeated five times by (only) changing the random_state to the following values:

In total, we have 25 synthetic datasets/points in this figure (5 experiments x 5 synth. dataset / experiment, see details above, in the Experiment 1 comment):

util_privacy_25

kasra-hosseini commented 4 years ago

Experiment 3

In the above experiments, the privacy (as measured by expected_match_risk) is very high. The values vary between ~95 to ~100%. One way to change this is by leaking more columns to the intruder. Here, all parameters are kept as before except the leaked columns to an intruder which were:

vars_intruder: ["Treatment", "Gender", "Age bracket"]

are now changed to:

vars_intruder: ["Treatment", "Gender", "Hospital ID", "Arrival Date", "Age bracket"]

Results:

utility_privacy

As expected, utility does not change, but values of the privacy metric are now between ~10% to ~100%.

kasra-hosseini commented 4 years ago

Experiment 4

By leaking all the columns:

"vars_intruder": ["Time in A&E (mins)","Treatment","Gender","Index of Multiple Deprivation Decile","Hospital ID","Arrival Date","Arrival hour range","Age bracket"]

We get privacy scores between 0 and 100 (in this dataset):

Results:

utlity_privacy

kasra-hosseini commented 4 years ago

Experiment 5

Here, three CTGAN synthetics are compared to synthpop results (Experiment 4). Similar to that experiment:

"vars_intruder": ["Time in A&E (mins)","Treatment","Gender","Index of Multiple Deprivation Decile","Hospital ID","Arrival Date","Arrival hour range","Age bracket"]
num_samples_intruder: 3000

Three CTGAN synthetics were generated by training the model for 100, 500 and 2000 epochs.

Results:

utility_privacy

kasra-hosseini commented 4 years ago

Based on the results in this issue #76 :

to correctly measure the utility of CTGAN models, we need to train a model for ~20K epochs. This needs to be revisited later.

kasra-hosseini commented 4 years ago

Experiment 6

Synthetic method: synthpop
Dataset: datasets/polish_data_2011/polish_data_2011
random_state: 12345, 23451, 34512
Columns (original data):

1: "smoke",
2: "sex",
3: "age",
4: "edu",
5: "weight",
6: "height",
7: "bmi",
8: "sport",
9: "marital",
10: "region",
11: "wkabint",
12: "income",
13: "ls"

vars_sequence: [3, 6, 5, 7, 2]

Other possible scenarios that we discussed (not used here):

age ---> edu ---> height ---> weight ---> bmi ---> sex 
age ---> height ---> weight ---> sex ---> marital ---> income ---> edu
age ---> marital ---> edu ---> income 
weight ---> bmi ---> sex ---> height

num_samples_intruder: 4895
vars_intruder: ["sex","age","edu","weight","height","bmi"]
Utility:
- "input_columns": ["age", "edu", "height", "weight", "bmi"],
- "label_column": "sex",

Six synthetic datasets were generated (for each random seed) which only differ in synthesis_method:

"synthesis_methods": ["", "sample", "sample", "", "sample", "sample", "sample", "", "", "", "", "", ""],
"synthesis_methods": ["", "cart", "", "", "cart", "cart", "cart", "", "", "", "", "", ""],
"synthesis_methods": ["", "cart", "", "", "cart", "", "cart", "", "", "", "", "", ""],
"synthesis_methods": ["", "cart", "", "", "", "", "cart", "", "", "", "", "", ""],
"synthesis_methods": ["", "cart", "", "", "", "", "", "", "", "", "", "", ""],
"synthesis_methods": ["", "", "", "", "", "", "", "", "", "", "", "", ""],

From 1 (randomly sampled) ---> 6 (original dataset), the synthetic dataset gradually becomes more similar to the original dataset.

Results:

kasra-hosseini commented 4 years ago

Experiment 7

Comparison between synthpop and CTGAN results for datasets/polish_data_2011/polish_data_2011:

polish

ots22 commented 4 years ago

The above figures look great - it would be good to capture them somewhere. Is there a notebook (for example) somewhere that includes these? Perhaps it could go alongside the credit card CTGAN notebook in examples?

kasra-hosseini commented 3 years ago

To reproduce figures of Experiments 6 and 7: https://github.com/alan-turing-institute/QUIPP-pipeline/tree/feature/119-reproducibility/examples/privacy_utility_tradeoff/exp_synthpop_ctgan_polish_data

Notebook: https://github.com/alan-turing-institute/QUIPP-pipeline/blob/feature/119-reproducibility/examples/privacy_utility_tradeoff/exp_synthpop_ctgan_polish_data/plot_privacy_utility.ipynb

QUIPP output files/dirs: https://github.com/alan-turing-institute/QUIPP-pipeline/tree/feature/119-reproducibility/examples/privacy_utility_tradeoff/exp_synthpop_ctgan_polish_data/outputs

alan-turing-institute / QUIPP-collab

Experiments: utility vs privacy (as measured by disclosure risk) #120