alan-turing-institute / QUIPP-collab

Collaboration on the QUIPP project
1 stars 1 forks source link

First attempt at a microsimulation based pipeline #22

Closed ots22 closed 4 years ago

ots22 commented 4 years ago

Add code for this pipeline to https://github.com/alan-turing-institute/QUIPP-pipeline under methods/LIBRARY_NAME/, and any datasets in datasets.

ots22 commented 4 years ago

Suggested dataset dataset: 1% sample ONS microdata teaching file Suggested method: simulated annealing Tools: MIPFP, IPF, maybe synthoop, GENSA (simulated annealing)

LouiseABowler commented 4 years ago

@gmingas, if you do end up taking a look at some microsimulation methods next week, the synthPop package seems to be a good place to start. I had a brief read through the paper this afternoon and I'm wishing I'd started with that one now - it has a nice example towards the end that will make a good starting point.

No worries if you'd rather focus on a different method; I can pick this up when I'm back on the 25th instead 😄

Edit: Oops, that should really have said simPop! Greg, sorry for being confusing there - but looks like you've picked up both anyway.

ots22 commented 4 years ago

https://www.ons.gov.uk/census/2011census/2011censusdata/censusmicrodata/microdatateachingfile

gmingas commented 4 years ago

I pushed a first quick synthpop pipeline to QUIPP-pipeline today. It is still quite basic and without proper quality controls (e.g. tests, error handling, etc). It uses the embedded data set. It demonstrates the main functionality and parameters of the package, how privacy can be tuned (the options are limited but it is a start) and a few utility metrics. Next I will try to test it on the ONS Census dataset and improve the pipeline. Also, I believe synthpop is actually a Multiple Imputation method (although the authors do not clearly adopt this characterisation) so maybe we should move the discussion under the relevant topic.

gmingas commented 4 years ago

Also, I will try to do sth similar with simPop which is tailored for microsimulation (along with other libraries in R like sms and humanleague).

Some relevant CRAN Task Views (especially the second one for microsim): https://cran.r-project.org/web/views/MissingData.html https://cran.r-project.org/web/views/OfficialStatistics.html

LouiseABowler commented 4 years ago

I had a search around for other relevant libraries yesterday; looks like there are more in the R ecosystem than elsewhere. The list collates packages mentioned in Spatial Microsimulation with R along with a few others I found elsewhere:

I can take a look at humanleague now as it has a couple of different methods available there. (Edit: Available in both Python and R - I'll have a go in Python)

niklomax commented 4 years ago

humanleague is the package which currently creates the synthetic population in version 1 of SPENSER (my Turing project). It was written by my former postdoc Andrew Smith - synthesis using IPF and Quasirandom integer sampling (similar to IPF but produces whole numbers rather than fractions). It should work using the code on the Github page but shout if any issues.

gmingas commented 4 years ago

You can use the following website to create aggregated (cross-tabulated) data from the 2011 UK Census. These can be used as the "target"/"marginal"/"cross-tabulated" input in microsimulation. My understanding is that the data come from the same source as the ONS Census sample we are currently using. http://infuse.ukdataservice.ac.uk/

gmingas commented 4 years ago

I uploaded my notes on simPop in the report