Improve matching approach

Hussein-Mahfouz commented 3 months ago

The current approach to matching the SPC to the NTS is:

Categorical matching (exact join) to match households
Propensity score matching to match individuals within households

Categorical matching is inflexible and some households in the SPC don't have any exact matches in the NTS (see here for matching results. It would be better to do Propensity Score Matching at the Household level from the beginning. This would ensure that each Household in the SPC is matched to at least one household in the NTS

Tools

The matchit R package is very comprehensive. It has different matching algorithms, and also allows you to specify different calipers for each covariate. This is very handy because we might want to be stricter on some covariates than others (e.g. for households, we may want the household size to match exactly, but be more forgiving on household income)

I didn't find a python library that has the same functionality as matchIt. In psmpy, you can only provide one caliper based on the overall distance

Hussein-Mahfouz commented 2 months ago

Another solution is to follow the SPC approach: do categorical matching iteratively, and after each iteration relax the constraints slightly. This should result in better matching at the household level. Example:

Round 1: Household income | Number of adults | Number of children | Employment status | Car ownership | Type of tenancy | Rural/Urban Classification
Round 2: Household income | Number of adults | Number of children | Employment status | Car ownership | ~~Type of tenancy~~ | Rural/Urban Classification
Round 3: ~~Household income~~ | Number of adults | Number of children | Employment status | Car ownership | ~~Type of tenancy~~ | Rural/Urban Classification
Round 4: ~~Household income~~ | Number of adults | Number of children | ~~Employment status~~ | Car ownership | ~~Type of tenancy~~ | Rural/Urban Classification

This should pprovide better results than the current match_categorical implementation where we only match once, and have to sacrifice some variables to improve matching (as shown here).

In the final round, all households that are yet to be matched can be matched either randomly, or to a household with values close to the mean

additional arguments to pass to match_categorical:

optional_columns: these are the columns that we can relax. It could be an ordered list, and at each iteration, we remove the last column from the list before matching

Hussein-Mahfouz commented 2 months ago

@sgreenbury the statistical matching approach I mentioned is in the ile-de-france project: link. I haven't tried to use the pipeline in the ile de france project, but it's well documented. Maybe it's something we should explore

Hussein-Mahfouz commented 1 month ago

I've just found the following description in this paper:

In the next step, each sampled individual is then matched to an observation from the 35 household travel survey, using hot-deck matching ((27), (17)). 36 The idea is to find all source observations (i.e. all samples from the household travel survey) that match the target observations (i.e. synthetic agents previously sampled from the census) on a 1 list of given matching attributes, and then to sample randomly one of those source observations. 2 To avoid over-fitting, if too few source observations are found for a given target observation, some 3 matching attributes are removed to enhance the set of matching source observations.

I like the step taken to avoid overfitting. They do statistical matching, but it can also be applied to categorical matching at the household level, and we would have a threshold for minimum number of matches

Urban-Analytics-Technology-Platform / acbm

Improve matching approach #13

Tools