alteryx / evalml

EvalML is an AutoML library written in python.
https://evalml.alteryx.com
BSD 3-Clause "New" or "Revised" License
772 stars 86 forks source link

Run XGBoost, CatBoost, and LightGBM without imputation #2981

Closed angela97lin closed 2 years ago

angela97lin commented 2 years ago

TIL that XGBoost, CatBoost, and LightGBM estimators can handle NaN values (thanks, @rpeck)! Right now, we automatically add an Imputer to every pipeline in AutoMLSearch. It could be interesting to compare how these estimators perform with and without imputation.

https://xgboost.readthedocs.io/en/latest/faq.html#how-to-deal-with-missing-values https://catboost.ai/en/docs/concepts/algorithm-missing-values-processing https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html#missing-value-handle

dsherry commented 2 years ago

Notes from discussion with @bchen1116 yesterday

Status: in progress, running perf tests. Have things working for numeric features. Categoricals will take more thought. Perf test results indicate that using our imputer is beneficial in some cases and relying on xgboost/catboost/lightgbm for imputation is beneficial in others. This is good news because it means if we can write our automl algo to run both, we can get a performance boost on some datasets with missing values.

Open questions:

Next steps

bchen1116 commented 2 years ago

Closing with no needed follow-up after discussion with @chukarsten! Performance results showed there weren't likely going to be noticeable benefits from this.