Open markotainio opened 6 years ago
Thinking out loud, I identify 3 key questions:
Re 2. I would suggest to translate into "generic domains of (synth.) population attributes" which should help guide data inputs (and processing) (even though, as in above example, attributes across different domains will come from the same sources, like census).
The main reason for this is that eventually the tool will have to be able to handle different levels of detail of input data, or variables with slightly different definitions when coming from different surveys. So we need a similar mapping exercise like discussed for mode definitions #14 , which will match "attribute data provided by user" with "generic attribute definition" used by the model.
For example:
...I imagine above structure would then be reflected in input data module, in synth pop. generation, in exposure estimation, etc. And eventually also reflected in flow chart, a bit like for input data here
Re 2. What if the user does not have:
Re 3. @AnnaGoodman1 's documentation for METAHIT is helpful. Main questions:
Do we have a nice, short write-up for the objective of creating a synthetic population?
"Comprehensive health impact assessment requires rich data on various attributes of the assessed population (i.e. socio-demographics, travel behavior, exposures, health). Because such data is typically not available from a single survey, data from multiple sources are merged. Since these data are not from the same individuals (i.e. different survey participants), they are matched by important attributes of participants, like age and sex. The resulting data set is a best possible representation of the local population including all the attributes collected by different surveys. The data records are synthetic because they combine information from multiple persons and assign it to a single, synthetic individual."
(replace as you see fit)
some notes from call:
Great work @AnnaGoodman1 definitely a challenging problem.
I also think we need to think about possible alternative methods for the matching itself. The current method of randomly selecting and matching one person from the bin of people based on the exact match of characteristics is sub-optimal in my opinion for at least 2 reasons:
The matching has the potential to generate misclassification, with some people being well matched and other people being poorly matched on key variables that will be modelled later. This approach also makes it hard to explore and understand the extent of the misclassification between participants.
Given that users will have different data both in terms of variables but also categories within the same variable that will need to be matched between datasets, creating a generalizable approach using the current matching method will be challenging.
I think exploring propensity score (or other statistical) matching methods could overcome some (hopefully many) of the challenges with this.
as a quick comment on propensity scores (@walkabilly)
Happily, the first case is what I think you are trying to do in ithim, so that sounds promising as something to try and I'll be interested to hear where you get to. good luck!
Discussion on how to create so called synthetic population that will be used in rest of the calculations.
Synthetic refers here to micro (individual) level data where we have individuals with:
Typically background PA and travel data are collected in different surveys and to create individuals with both we need to match individuals from both datasets to create "synthetic" individuals. This is done using stochastic matching of individuals, based on some predefined attributes (e.g. age, sex, socioeconomic status, location).
Attached to this message you can see description by @AnnaGoodman1 on how she has created synthetic population for METAHIT project using three English datasets:
Also attached is sample on how the data looks for ten sample individuals, and variable descriptions (note, sample doesn't include all the variables).
The idea for ITHIM-R would be to include this stochastic sampling inside the ITHIM-R so that user would need to provide individual level datasets, but ITHIM-R would be able to create synthetic population from these datasets.
0.1a_SPpreparation_Overview_180206.docx
180219_Metahit10000_v3_sample.xlsx