alan-turing-institute / synthetic-data-general

Central repository to keep track of all the synthetic data projects ongoing at REG and the Turing
MIT License
0 stars 1 forks source link

Weekly Standup 2021/11/15 #3

Open github-actions[bot] opened 2 years ago

github-actions[bot] commented 2 years ago

Please post any useful updates from your project

callummole commented 2 years ago

In the ONS project we are recreating datasets from publicly available tables to assess the privacy risk of releasing those tables.

More detail: we are producing synthetic data by reconstructing distributions based on a set of two-way marginal tables (snapshots of how many individuals in a dataset are present in each combined category of two variables e.g. for the two variables marital status and sex a single cell could be single + male). We use a method called iterative proportional fitting for this, which iteratively adjusts a distribution so that the marginal counts are correct (e.g. it matches how many single/married people there are and also matches how many male/females etc). We have begun analysing the extent that individuals present in the 1% census data teaching file are also present in the recreated dataset (based on two-way tables of the 1% census data), thereby assessing the privacy risk of releasing tables.

code

triangle-man commented 2 years ago

I intend to start https://github.com/alan-turing-institute/Hut23/issues/1013 Synthetic Data, Federated Learning, and Privacy Trade-Offs by writing a backbrief. Will create a repo and link to it from here.

crangelsmith commented 2 years ago

I've been trying to understand the current state of QUIPP. From conversations with the QUIPP team to attempting to run the pipeline. For the next week or so I'm focusing on finishing the RDS course but once that is done I plan to write up a report that describes what QUIPP is and what it can do.