alan-turing-institute / synthetic-data-general

Central repository to keep track of all the synthetic data projects ongoing at REG and the Turing
MIT License
0 stars 1 forks source link

Weekly Standup 2021-12-06 #6

Open github-actions[bot] opened 2 years ago

github-actions[bot] commented 2 years ago

Please post any useful updates from your project

callummole commented 2 years ago

Not much progress on code last week due to being off with the flu. But it is useful to present the ONS project as two emerging sub-projects/work-streams, both currently using the Census 1% teaching data.

1) Privacy Analysis of available tables: Comprised of the Turing research effort (Sam, Lukasz, Me, James Jordon, Florimon) and some more research/technically minded folk from ONS (Robin Mitra - Cardiff Uni; Alex N - ONS). This stream of work is the priority and consists of theory-led privacy analysis of synthetic data created from a larger dataset with the constraints given by a number of 2 or 3-way marginal counts.

2) QUIPP & SynthGauge. ONS have developed SynthGauge, a python package that contains metrics and visualisation tools for general assessment of synthetic data. This does some of the things that QUIPP does (but doesn't do synthetic data generation) and I'm not sure where the overlap stops. I've suggested that SynthGauge be made public so that other REG projects can use it and provide feedback on it. This was met favourably, pending senior approval. I have access to the repository (SynthGauge is actually a folder in a project-wide repository currently) and SynthGauge is well documented and seems to be sensibly structured and easy to use, so could be really useful on projects where assessment of a dataset is the main goal. I'm not completely sure what this work stream is (currently there is discussion that QUIPP generates the data and SynthGauge assesses it, repeat) but it is important to the project because it is an obvious way that Turing & ONS are collaborating.

@triangle-man @crangelsmith the biggest opportunity for synergy across projects is in 2). There is work needed to make QUIPP accessible to the outsider, which I probably won't have time for. Fortunately QUIPP is a focus for you both, and @crangelsmith started last week on making a QUIPP tutorial using the Census 1% as a case study. On top of the software-development there is work to be done assessing, both theoretically and practically, how to go about creating and evaluating synthetic data. I see potential for something like a Synthetic Data Workflow (e.g. Bayesian Workflow, not as involved but similar in principle) to be a cross-cutting theme the projects work towards, using the individual project data requirements as case studies.

crangelsmith commented 2 years ago

Last week I was trying to get my head around QUiPP, both the software and the project. This meant trying to run the code, exploring different branches and conversations with @callummole and @gmingas. Some of my understanding is written in this document (still very preliminary text, full of notes and placeholders). Some main points from last week:

  1. There are two branches that diverged from develop and have important contributions. One is where the census data was generated for the ONS project, which has the most documented example of how QUiPP works. The other is the develop-paper branch that has some new utility metrics and updated synthesis methods (plus a lot of other developments specifically made for the paper), but not very clear documentation/examples.

  2. QUiPP is in need of some harmonisation. The develop branch is behind the main new developments but not everything that was done in some of the branches is useful for QUiPP as a data synthesis and evaluation pipeline. Furthermore, there is a lot of legacy code /data/methods that were added at the beginning for exploration and that are running in the Makefile for no real reason. I think some of the new contributions from develop-paper need to be brought to a develop branch, but the Makefile needs to be simplified and the structure of the pipeline elevated.

  3. QUiPP is in need of a detailed tutorial that demonstrates the workflow, all possible synthesis methods available and utility/privacy metrics. The notebooks generated for the census in the ONS project are the closest to this, but do not include all the new developments from the develop-paper branch and is not very well documented (it wasn't supposed to be a tutorial). Given that the ONS project is atm our main user (and there are some requests from the ONS team to @callummole about clarity of what was done), I believe we can use this example as the base for writing a QUiPP tutorial on a harmonised updated branch.

This week I plan to keep working on the document above, focusing more on documenting the available methods and metrics and I'll also start thinking in how to implement points 2. and 3. having in mind as well the point from @callummole about SynthGauge.

triangle-man commented 2 years ago

@crangelsmith Is it possible to point HackMD at a repo? (It would be nice to have your document in the repo, though I understand it's nice to work on things in HackMD.)

martintoreilly commented 2 years ago

I think so: See https://hackmd.io/c/tutorials/%2Fs%2Flink-with-github

triangle-man commented 2 years ago

See my notes on privacy in https://github.com/alan-turing-institute/synthetic-data-federated-learning/tree/draft/meta