ITHIM / ITHIM-R

Development of the ITHIM-R, also known as ITHIM version 3.0. Started in January 2018.
https://ithim.github.io/ITHIM-R/
GNU General Public License v3.0
19 stars 12 forks source link

Synthetic population (stochastic matching of individuals) #22

Open markotainio opened 6 years ago

markotainio commented 6 years ago

Discussion on how to create so called synthetic population that will be used in rest of the calculations.

Synthetic refers here to micro (individual) level data where we have individuals with:

Typically background PA and travel data are collected in different surveys and to create individuals with both we need to match individuals from both datasets to create "synthetic" individuals. This is done using stochastic matching of individuals, based on some predefined attributes (e.g. age, sex, socioeconomic status, location).

Attached to this message you can see description by @AnnaGoodman1 on how she has created synthetic population for METAHIT project using three English datasets:

Also attached is sample on how the data looks for ten sample individuals, and variable descriptions (note, sample doesn't include all the variables).

The idea for ITHIM-R would be to include this stochastic sampling inside the ITHIM-R so that user would need to provide individual level datasets, but ITHIM-R would be able to create synthetic population from these datasets.

0.1a_SPpreparation_Overview_180206.docx

180219_Metahit10000_v3_sample.xlsx variables_v2

gotom22 commented 6 years ago

Thinking out loud, I identify 3 key questions:

  1. Matching/translating of user input data with generic attribute definitions used by the model
  2. How to handle gaps in input data provided by user
  3. "Hierarchy of priorities in how the attributes define the synth. pop" (I try to explain below)

Re 2. I would suggest to translate into "generic domains of (synth.) population attributes" which should help guide data inputs (and processing) (even though, as in above example, attributes across different domains will come from the same sources, like census).

The main reason for this is that eventually the tool will have to be able to handle different levels of detail of input data, or variables with slightly different definitions when coming from different surveys. So we need a similar mapping exercise like discussed for mode definitions #14 , which will match "attribute data provided by user" with "generic attribute definition" used by the model.

For example:

...I imagine above structure would then be reflected in input data module, in synth pop. generation, in exposure estimation, etc. And eventually also reflected in flow chart, a bit like for input data here

Re 2. What if the user does not have:

Re 3. @AnnaGoodman1 's documentation for METAHIT is helpful. Main questions:

gotom22 commented 6 years ago

Do we have a nice, short write-up for the objective of creating a synthetic population?

"Comprehensive health impact assessment requires rich data on various attributes of the assessed population (i.e. socio-demographics, travel behavior, exposures, health). Because such data is typically not available from a single survey, data from multiple sources are merged. Since these data are not from the same individuals (i.e. different survey participants), they are matched by important attributes of participants, like age and sex. The resulting data set is a best possible representation of the local population including all the attributes collected by different surveys. The data records are synthetic because they combine information from multiple persons and assign it to a single, synthetic individual."

(replace as you see fit)

gotom22 commented 6 years ago

some notes from call:

walkabilly commented 6 years ago

Great work @AnnaGoodman1 definitely a challenging problem.

I also think we need to think about possible alternative methods for the matching itself. The current method of randomly selecting and matching one person from the bin of people based on the exact match of characteristics is sub-optimal in my opinion for at least 2 reasons:

  1. The matching has the potential to generate misclassification, with some people being well matched and other people being poorly matched on key variables that will be modelled later. This approach also makes it hard to explore and understand the extent of the misclassification between participants.

  2. Given that users will have different data both in terms of variables but also categories within the same variable that will need to be matched between datasets, creating a generalizable approach using the current matching method will be challenging.

I think exploring propensity score (or other statistical) matching methods could overcome some (hopefully many) of the challenges with this.

AnnaGoodman1 commented 6 years ago

as a quick comment on propensity scores (@walkabilly)

  1. I could see this working particularly well if you were trying to merge in just one variable or a small number of very related variables. this is like what we did in the second metahit match / in the impacts of cycling tool, namely taking the National travel survey, and merging in background physical activity from the active people survey.
  2. it is less obvious to me what this would look like if one were trying to do take individual level data from one place, and give them a full past-week travel diary, i.e. what we did in the first metahit match by taking the census data and then merging in the National travel survey. In this case it seems less obvious to me what you would be looking at propensity for : amount of travel? Pattern of travel? Purposes of travel? modal split of travel?

Happily, the first case is what I think you are trying to do in ithim, so that sounds promising as something to try and I'll be interested to hear where you get to. good luck!