Synthetic population (stochastic matching of individuals)

markotainio commented 6 years ago

Discussion on how to create so called synthetic population that will be used in rest of the calculations.

Synthetic refers here to micro (individual) level data where we have individuals with:

Background (non-travel, non-occupation) physical activity behavior (e.g. based on health surveys)
Background travel data on trips of individuals (based e.g. on travel survey, census etc.)

Typically background PA and travel data are collected in different surveys and to create individuals with both we need to match individuals from both datasets to create "synthetic" individuals. This is done using stochastic matching of individuals, based on some predefined attributes (e.g. age, sex, socioeconomic status, location).

Attached to this message you can see description by @AnnaGoodman1 on how she has created synthetic population for METAHIT project using three English datasets:

Census
Travel survey
Active peoples survey

Also attached is sample on how the data looks for ten sample individuals, and variable descriptions (note, sample doesn't include all the variables).

The idea for ITHIM-R would be to include this stochastic sampling inside the ITHIM-R so that user would need to provide individual level datasets, but ITHIM-R would be able to create synthetic population from these datasets.

0.1a_SPpreparation_Overview_180206.docx

180219_Metahit10000_v3_sample.xlsx variables_v2

gotom22 commented 6 years ago

Thinking out loud, I identify 3 key questions:

Matching/translating of user input data with generic attribute definitions used by the model
How to handle gaps in input data provided by user
"Hierarchy of priorities in how the attributes define the synth. pop" (I try to explain below)

Re 2. I would suggest to translate into "generic domains of (synth.) population attributes" which should help guide data inputs (and processing) (even though, as in above example, attributes across different domains will come from the same sources, like census).

The main reason for this is that eventually the tool will have to be able to handle different levels of detail of input data, or variables with slightly different definitions when coming from different surveys. So we need a similar mapping exercise like discussed for mode definitions #14 , which will match "attribute data provided by user" with "generic attribute definition" used by the model.

For example:

Basic Population data (basic socio-demographics) -- age (distribution) -- sex (distribution) -- income
Spatial Population data -- home location -- work location -- region, neighborhood attributes
Behavioural data -- travel ---- vehicle ownership -- physical activity ---- by domains
Exposure relevant data -- air pollution ---- city background PM ---- proximity to freeway ---- ... ---- model derived estimates based no home location coordinates ---- model derived estimates based on travel data and home XY -- noise -- traffic (as injury risk)
Health relevant data -- BMI -- ...

...I imagine above structure would then be reflected in input data module, in synth pop. generation, in exposure estimation, etc. And eventually also reflected in flow chart, a bit like for input data here

Re 2. What if the user does not have:

the same definitions for the attribute variables: user interface/input module will have to provide some translation map.
all attributes used by the model: will default distributions be used? Can the user adjust these (e.g. move the average age in a default age distribution)?

Re 3. @AnnaGoodman1 's documentation for METAHIT is helpful. Main questions:

is the plan to have the same (METAHIT) hard-coded matching/synthesization procedure?, or
will the tool first collect "available data" and then identify a "optimal synthesization process", and if so
what is the "synthesization sequence", probably using basic data first, then adding layers of complexity?

gotom22 commented 6 years ago

Do we have a nice, short write-up for the objective of creating a synthetic population?

"Comprehensive health impact assessment requires rich data on various attributes of the assessed population (i.e. socio-demographics, travel behavior, exposures, health). Because such data is typically not available from a single survey, data from multiple sources are merged. Since these data are not from the same individuals (i.e. different survey participants), they are matched by important attributes of participants, like age and sex. The resulting data set is a best possible representation of the local population including all the attributes collected by different surveys. The data records are synthetic because they combine information from multiple persons and assign it to a single, synthetic individual."

(replace as you see fit)

gotom22 commented 6 years ago

some notes from call:

most common use case prob.: matching travel survey and health survey
matching would be part of tool
would matching criteria be flexible, or hard coded?
calculation duration in UK was an issue, since generating 40 million agents (~1h)
@AnnaGoodman1 has STATA code (built from scratch, incl. some STATA functions maybe)
which are the key matching variables?
issues with trying to match on indicators of background physical activities.
area types (based on land use or similar) could make for an interesting matching criterion. Neil will send article. Something like walk score/land use index may be key.
write up the advantages, objectives of matching/synthetic population (i.e. nonlin DRF and background PA, substitution effect,
income!
would modeling a PA distribution be prefarable over noisy PA survey data
matching variable selection should prob. be flexible, prob. in addition to age and sex.

walkabilly commented 6 years ago

Great work @AnnaGoodman1 definitely a challenging problem.

I also think we need to think about possible alternative methods for the matching itself. The current method of randomly selecting and matching one person from the bin of people based on the exact match of characteristics is sub-optimal in my opinion for at least 2 reasons:

The matching has the potential to generate misclassification, with some people being well matched and other people being poorly matched on key variables that will be modelled later. This approach also makes it hard to explore and understand the extent of the misclassification between participants.
Given that users will have different data both in terms of variables but also categories within the same variable that will need to be matched between datasets, creating a generalizable approach using the current matching method will be challenging.

I think exploring propensity score (or other statistical) matching methods could overcome some (hopefully many) of the challenges with this.

AnnaGoodman1 commented 6 years ago

as a quick comment on propensity scores (@walkabilly)

I could see this working particularly well if you were trying to merge in just one variable or a small number of very related variables. this is like what we did in the second metahit match / in the impacts of cycling tool, namely taking the National travel survey, and merging in background physical activity from the active people survey.
it is less obvious to me what this would look like if one were trying to do take individual level data from one place, and give them a full past-week travel diary, i.e. what we did in the first metahit match by taking the census data and then merging in the National travel survey. In this case it seems less obvious to me what you would be looking at propensity for : amount of travel? Pattern of travel? Purposes of travel? modal split of travel?

Happily, the first case is what I think you are trying to do in ithim, so that sounds promising as something to try and I'll be interested to hear where you get to. good luck!

ITHIM / ITHIM-R

Synthetic population (stochastic matching of individuals) #22