ccao-data / model-res-avm

Automated valuation model for all class 200 residential properties in Cook County (except vacant land and condos)
GNU Affero General Public License v3.0
26 stars 5 forks source link

Create price point / strata model feature #196

Closed dfsnow closed 8 months ago

dfsnow commented 8 months ago

This PR adds a "price point" or "market strata" model feature based on a property's prior year values. The goal of this feature is to roughly capture where a property lies in the distribution of price and to give the model a "hint" or starting point for prediction.

The feature is constructed by first determining the "strata price." This is the most up-to-date value available for a given property. Strata price is equal to the following value (whichever is available first), in descending order of preference:

  1. The maximum sale price in the last N years excluding the current sale (in the case of training data)
  2. The prior year's BoR certified market value
  3. The prior year's Assessor certified market value
  4. The 2 year prior BoR certified market value

The resulting strata price is then binned into N-tiles based on township and year. The binned N-tile is passed to the model as a categorical feature.

Pros:

Cons:

Closes #160.

CC @ccao-jardine

dfsnow commented 8 months ago

Constructing a price strata feature OR a lagged price feature doesn't work in this model.

To explain why, I'll focus on the lag price variant of the feature, since the strata feature is just a binned version of the lag price. The lag price feature is equal to whichever of the following is first available:

  1. The maximum sale price in the last N years, excluding the current sale
  2. The prior year's BoR certified market value
  3. The prior year's Assessor certified market value
  4. The 2 year prior BoR certified market value

This construction is intended to act like an autoregressive feature in a time series model, i.e. we believe the predicted price is (to some extent) dependent on the prior price, so the prior price must be included as a feature.

This would work if all properties had sales, however things get tricky when we need to construct the feature for the assessment set (the universe of all properties). There are two major problems:

  1. Some properties have sales and others do not. Properties without sales use the prior assessed value to construct the lag price. However, since assessments occur only every 3 years, properties without sales will have a mechanically lower lag price than those with sales. The result is that sold and unsold properties are treated differently.
  2. The construction of the lag price feature includes sales from the prior 4 years in both the assessment set and the training set. However, in the assessment set the feature is constructed looking back from the lien date rather than the date of a real sale. In this case, because the lien date is Jan 1, 2024, we look back at all sales 2020-2023. However, since the assessment set performance is measured using 2023 sales, we are effectively including the outcome as a predictor. No bueno.

So, scrapping this feature for now. I think in the future we should spend some time considering a better way to construct some kind of autoregressive features.