Closed angela97lin closed 2 years ago
Notes from discussion with @bchen1116 yesterday
Status: in progress, running perf tests. Have things working for numeric features. Categoricals will take more thought. Perf test results indicate that using our imputer is beneficial in some cases and relying on xgboost/catboost/lightgbm for imputation is beneficial in others. This is good news because it means if we can write our automl algo to run both, we can get a performance boost on some datasets with missing values.
Open questions:
Next steps
Closing with no needed follow-up after discussion with @chukarsten! Performance results showed there weren't likely going to be noticeable benefits from this.
TIL that XGBoost, CatBoost, and LightGBM estimators can handle NaN values (thanks, @rpeck)! Right now, we automatically add an Imputer to every pipeline in AutoMLSearch. It could be interesting to compare how these estimators perform with and without imputation.
https://xgboost.readthedocs.io/en/latest/faq.html#how-to-deal-with-missing-values https://catboost.ai/en/docs/concepts/algorithm-missing-values-processing https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html#missing-value-handle