donboyd5 / synpuf

Synthetic PUF
MIT License
4 stars 3 forks source link

Add disclosure analysis notebooks #31

Closed MaxGhenis closed 5 years ago

MaxGhenis commented 5 years ago

I previously pushed these notebooks to master but this updates them, and the PR describes them.

This includes two notebooks:

  1. generate_disclosure_risk.ipynb calls synthimpute functions to generate disclosure risk datasets for synpuf7 and synthpop_samp. These are keyed on the synthetic ID, with distances and IDs for nearest records in the train and test sets.
  2. disclosure_risk.ipynb loads those disclosure risk datasets and evaluates the distribution and relationships of train_dist and test_dist for both synpuf7 and synthpop_samp: scatterplots of train_dist vs. test_dist, CDFs, and shares of exact matches to the train and test sets.

My main takeaway is that, at least on this size dataset and with these specifications, synthpop outperforms random forests:

NB: Cells 4-6 of disclosure_risk.ipynb contain synth_id and how close it is to a training record. This is for informational purposes, but could itself be a disclosure risk if we were to publish a synthetic file. Since those IDs and records will differ in the final file, I think it's OK to keep here for now, but a final analysis should omit them.

I've also added to the restricted Drive folder synpuf_disclosure_risk_with_record_exam.ipynb (this link will only be accessible to the team). This notebook shows four example synthetic records compared to their nearest train and test records for both synthpop and RF:

  1. A "typical" record with values of all relevant distance metrics between the 40th and 60th percentile. These didn't tend to be worryingly similar; more variables didn't match exactly than those that did.
  2. Synthetic records which exactly match both a training record and a test record. These were very simple records which look like they could represent many thousands of taxpayers. (Aside: we should remove duplicate records and add weight for faster execution.)
  3. The synthetic record which exactly matches a training record and has the largest distance from a test record. These are potential disclosure risks, and we should monitor them when synthesizing larger datasets. We could also explore mitigating approaches like synthesizing more records and discarding exact matches.
  4. The synthetic record which exactly matches a test record and has the largest distance from a training record (we like these). These were relatively simple records too.

One other note: I split the distances by MARS/DSI/XTOT as anything less was crashing my Python kernel. So this might be a bit more of a rosy picture than we'd get if we did the full distance matrix. I'll be looking into optimizations to make this possible.