Run XGBoost, CatBoost, and LightGBM without imputation

angela97lin commented 2 years ago

TIL that XGBoost, CatBoost, and LightGBM estimators can handle NaN values (thanks, @rpeck)! Right now, we automatically add an Imputer to every pipeline in AutoMLSearch. It could be interesting to compare how these estimators perform with and without imputation.

https://xgboost.readthedocs.io/en/latest/faq.html#how-to-deal-with-missing-values https://catboost.ai/en/docs/concepts/algorithm-missing-values-processing https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html#missing-value-handle

dsherry commented 2 years ago

Notes from discussion with @bchen1116 yesterday

Status: in progress, running perf tests. Have things working for numeric features. Categoricals will take more thought. Perf test results indicate that using our imputer is beneficial in some cases and relying on xgboost/catboost/lightgbm for imputation is beneficial in others. This is good news because it means if we can write our automl algo to run both, we can get a performance boost on some datasets with missing values.

Open questions:

How do we handle missing values for categoricals?
What should the automl algo do for datasets with missing values?

Next steps

Run experiments for categoricals, brainstorm strategies
Close this spike
File another issue (design spike) to track updating the automl algorithm to run pipelines both with a) our imputer component up front and b) no imputer for xgboost/catboost/lightgbm

bchen1116 commented 2 years ago

Closing with no needed follow-up after discussion with @chukarsten! Performance results showed there weren't likely going to be noticeable benefits from this.

alteryx / evalml

Run XGBoost, CatBoost, and LightGBM without imputation #2981