Closed dfsnow closed 5 months ago
I tested this a bunch locally using the recipes::step_impute_
functions of different flavors. TL;DR, it doesn't make much different for our outcomes. We don't have too much missingness to begin with and it seems like LightGBM does a fine job of handling it natively.
The one thing I was unable to test was the more advanced imputation strategies such as bagging and KNN. Each of them takes absolutely forever to run, even on a beefy m4/m5 AWS instance.
I'd say this is worth revisiting in the future, but probably won't have a big immediate impact on model outcomes.
Currently, missingness in the training data for the primary model is dealt with via LightGBM's native handling. However, we should try to deal with missingness more intelligently, taking the specifics of each feature into account.
recipes
includes a number of steps we can use for imputation, some of which are already tested and working in the linear model's recipe.