ccao-data / model-res-avm

Automated valuation model for all class 200 residential properties in Cook County (except vacant land and condos)
GNU Affero General Public License v3.0
20 stars 3 forks source link

Test dedicated missingness imputation for main model #162

Closed dfsnow closed 5 months ago

dfsnow commented 5 months ago

Currently, missingness in the training data for the primary model is dealt with via LightGBM's native handling. However, we should try to deal with missingness more intelligently, taking the specifics of each feature into account. recipes includes a number of steps we can use for imputation, some of which are already tested and working in the linear model's recipe.

dfsnow commented 5 months ago

I tested this a bunch locally using the recipes::step_impute_ functions of different flavors. TL;DR, it doesn't make much different for our outcomes. We don't have too much missingness to begin with and it seems like LightGBM does a fine job of handling it natively.

The one thing I was unable to test was the more advanced imputation strategies such as bagging and KNN. Each of them takes absolutely forever to run, even on a beefy m4/m5 AWS instance.

I'd say this is worth revisiting in the future, but probably won't have a big immediate impact on model outcomes.