AnotherSamWilson / miceforest

Multiple Imputation with LightGBM in Python
MIT License
353 stars 31 forks source link

Optional support by XGBoost for easier GPU integration #77

Closed DavidHarar closed 1 year ago

DavidHarar commented 1 year ago

Hi, First, thank you very much for the great package. Using it lifted my model's KPIs. One thing is, indirectly connected with the miforest, is that it uses light-GBM.
It seems that LGBM has some compatibility issues with GPU and a separate installation (see here). I tried multiple times to install its GPU version without any success, so I ended up using only one or two iterations because of my large dataset. Is there an option to add compatibility with other boosters like XGBoost?

Thank you very much!

AnotherSamWilson commented 1 year ago

Unfortunately no. There is another issue to make the underlying imputation model more modular, but it is a deciptively complex problem. Each model has its own way of dealing with categories, shap values, how the data is encoded, etc. It would be a big project, but not an impossible one.

The underlying model actually used to be sklearn random forests, but they take up a huge amount of memory and are slower. Switching to lightgbm was a big ordeal.

HOWEVER, I would not give up on the lightgbm GPU training. It can be a pain, but what worked for me is installing the wheel version of the package, and simply passing device_type='gpu' into the parameters. I know that lightgbm has a bunch of crazy compilation steps that they show in the tutorial, but apparently the wheel comes pre-built with gpu abilities.

You can verify you are using the gpu version by running a small model in a seperate script with the verbose parameters set to maximum. It will say something like: Using the gpu version of lightgbm! in the output.

AnotherSamWilson commented 1 year ago

@DavidHarar Curious if you got lightgbm up and running with gpu.