Open kasra-hosseini opened 4 years ago
Experiment 1
generator-outputs/odi-nhs-ae/hospital_ae_data_deidentify
Five synthetic datasets were generated which only differ in synthesis_method
:
1: "synthesis_methods": ["sample", "", "sample", "", "sample", "", "", "sample"], ---> randomly sampled (all 4 columns) 2: "synthesis_methods": ["cart", "", "cart", "", "", "", "", "cart"], 3: "synthesis_methods": ["cart", "", "", "", "", "", "", "cart"], 4: "synthesis_methods": ["cart", "", "", "", "", "", "", ""], 5: "synthesis_methods": ["", "", "", "", "", "", "", ""], ---> copy of the original data
From 1 (randomly sampled) ---> 5, the synthetic dataset gradually becomes more similar to the original dataset.
Results:
Utility: weighted-f1 score, 100 means the f1-score of a model trained on a synthetic dataset is the same as a model trained on the original one.
Privacy: 100 * (1 - expected_match_risk/num_samples_intruder)
Experiment 2
The same as Experiment 1 except:
Results:
Experiment 1 was repeated five times by (only) changing the random_state
to the following values:
12345
23451
34512
45123
54321
In total, we have 25 synthetic datasets/points in this figure (5 experiments x 5 synth. dataset / experiment, see details above, in the Experiment 1 comment):
Experiment 3
In the above experiments, the privacy (as measured by expected_match_risk
) is very high. The values vary between ~95 to ~100%. One way to change this is by leaking more columns to the intruder. Here, all parameters are kept as before except the leaked columns to an intruder which were:
vars_intruder: ["Treatment", "Gender", "Age bracket"]
are now changed to:
vars_intruder: ["Treatment", "Gender", "Hospital ID", "Arrival Date", "Age bracket"]
Results:
As expected, utility does not change, but values of the privacy metric are now between ~10% to ~100%.
Experiment 4
By leaking all the columns:
"vars_intruder": ["Time in A&E (mins)","Treatment","Gender","Index of Multiple Deprivation Decile","Hospital ID","Arrival Date","Arrival hour range","Age bracket"]
We get privacy scores between 0 and 100 (in this dataset):
Results:
Experiment 5
Here, three CTGAN synthetics are compared to synthpop results (Experiment 4). Similar to that experiment:
Three CTGAN synthetics were generated by training the model for 100, 500 and 2000 epochs.
Results:
Based on the results in this issue #76 :
to correctly measure the utility of CTGAN models, we need to train a model for ~20K epochs. This needs to be revisited later.
Experiment 6
datasets/polish_data_2011/polish_data_2011
1: "smoke",
2: "sex",
3: "age",
4: "edu",
5: "weight",
6: "height",
7: "bmi",
8: "sport",
9: "marital",
10: "region",
11: "wkabint",
12: "income",
13: "ls"
Other possible scenarios that we discussed (not used here):
age ---> edu ---> height ---> weight ---> bmi ---> sex
age ---> height ---> weight ---> sex ---> marital ---> income ---> edu
age ---> marital ---> edu ---> income
weight ---> bmi ---> sex ---> height
Six synthetic datasets were generated (for each random seed) which only differ in synthesis_method
:
From 1 (randomly sampled) ---> 6 (original dataset), the synthetic dataset gradually becomes more similar to the original dataset.
Results:
Experiment 7
Comparison between synthpop and CTGAN results for datasets/polish_data_2011/polish_data_2011
:
The above figures look great - it would be good to capture them somewhere. Is there a notebook (for example) somewhere that includes these? Perhaps it could go alongside the credit card CTGAN notebook in examples?
To reproduce figures of Experiments 6 and 7: https://github.com/alan-turing-institute/QUIPP-pipeline/tree/feature/119-reproducibility/examples/privacy_utility_tradeoff/exp_synthpop_ctgan_polish_data
QUIPP output files/dirs: https://github.com/alan-turing-institute/QUIPP-pipeline/tree/feature/119-reproducibility/examples/privacy_utility_tradeoff/exp_synthpop_ctgan_polish_data/outputs
Generate different synthetic datasets using synthpop and CTGAN, and compare utility and privacy metrics.