Closed ots22 closed 4 years ago
Suggested dataset dataset: 1% sample ONS microdata teaching file Suggested method: simulated annealing Tools: MIPFP, IPF, maybe synthoop, GENSA (simulated annealing)
@gmingas, if you do end up taking a look at some microsimulation methods next week, the synthPop
package seems to be a good place to start. I had a brief read through the paper this afternoon and I'm wishing I'd started with that one now - it has a nice example towards the end that will make a good starting point.
No worries if you'd rather focus on a different method; I can pick this up when I'm back on the 25th instead 😄
Edit: Oops, that should really have said simPop
! Greg, sorry for being confusing there - but looks like you've picked up both anyway.
I pushed a first quick synthpop pipeline to QUIPP-pipeline today. It is still quite basic and without proper quality controls (e.g. tests, error handling, etc). It uses the embedded data set. It demonstrates the main functionality and parameters of the package, how privacy can be tuned (the options are limited but it is a start) and a few utility metrics. Next I will try to test it on the ONS Census dataset and improve the pipeline. Also, I believe synthpop is actually a Multiple Imputation method (although the authors do not clearly adopt this characterisation) so maybe we should move the discussion under the relevant topic.
Also, I will try to do sth similar with simPop which is tailored for microsimulation (along with other libraries in R like sms and humanleague).
Some relevant CRAN Task Views (especially the second one for microsim): https://cran.r-project.org/web/views/MissingData.html https://cran.r-project.org/web/views/OfficialStatistics.html
I had a search around for other relevant libraries yesterday; looks like there are more in the R ecosystem than elsewhere. The list collates packages mentioned in Spatial Microsimulation with R
along with a few others I found elsewhere:
simPop
and humanleague
synthpop
humanleague
I can take a look at humanleague
now as it has a couple of different methods available there. (Edit: Available in both Python and R - I'll have a go in Python)
humanleague is the package which currently creates the synthetic population in version 1 of SPENSER (my Turing project). It was written by my former postdoc Andrew Smith - synthesis using IPF and Quasirandom integer sampling (similar to IPF but produces whole numbers rather than fractions). It should work using the code on the Github page but shout if any issues.
You can use the following website to create aggregated (cross-tabulated) data from the 2011 UK Census. These can be used as the "target"/"marginal"/"cross-tabulated" input in microsimulation. My understanding is that the data come from the same source as the ONS Census sample we are currently using. http://infuse.ukdataservice.ac.uk/
Add code for this pipeline to https://github.com/alan-turing-institute/QUIPP-pipeline under
methods/LIBRARY_NAME/
, and any datasets indatasets
.