Test smaller / different geographic features

ccao-data / model-res-avm

Automated valuation model for all class 200 residential properties in Cook County (except vacant land and condos)

GNU Affero General Public License v3.0

20 stars 3 forks source link

Currently, the residential AVM relies on a combination of township, neighborhood, and lat/lon to determine the value of location. These features tend to be among the most important in the model. However, they are not always well-defined or relevant to price. Neighborhoods can be too large (or too small), and township boundaries are mostly arbitrary. We should test some smaller units of geography as geography features:

Census tracts + PUMA
Census block groups + PUMA
Sidwell ID AKA PIN, broken out by block or similar

One interesting thing to try: currently the model doesn't have way to measure neighborhood proximity, i.e. it doesn't know two neighborhoods are close together. If we numerically order the neighborhoods and treat them as numeric, rather than categorical, predictors, then the model might be able to group neighborhoods by their relative proximity.

We've tested a few different combinations of geographic features + treatments of geographic variables:

Census tracts + PUMA (as categoricals)
- Commit: 3f2742dae5d7334b2361252dea0f478c3207e388
- Run ID: 2024-01-16-cranky-sam
- Result: No improvements
Census tracts + PUMA (tracts as numeric)
- Commit: c13401b6d7bd51f259e1f01f7d2c7feca6f5908c
- Run ID: 2024-01-17-nostalgic-christian
- Result: Very minor improvement in most stats (<1%)
All high cardinality features as numerics
- Commit: 6cd5ff51b0d716cc07edbbf3bcc3417f7e37624a
- Run ID: 2024-01-17-eager-boni
- Result: No improvements
Sidwell ID AKA PIN, broken out by block or similar
- Commit: overrode with force push, oops
- Run ID: 2024-01-17-practical-billy
- Result: Worse than standard location predictors

It seems like changing these predictors is mostly tinkering around the margins. Notably, removing or changing one location predictor causes others to "pick up the slack." For instance, removing all location features makes median income (a proxy for location) become more important. As such, I don't think the ROI on these changes is high enough to pursue them further, and I'm closing this issue.

ccao-data / model-res-avm

Test smaller / different geographic features #166