AnotherSamWilson / miceforest

Multiple Imputation with LightGBM in Python
MIT License
353 stars 31 forks source link

GPU support #79

Closed ZSZYoung closed 1 year ago

ZSZYoung commented 1 year ago

I have built the GPU version of LGB following the guide here: Guide. It worked in the sample test code during the training process, but after installing Miceforest from Anaconda, Miceforest can't work and it tells me to rebuild LightGBM with the GPU version. It seems that the process of installing Miceforest has replaced the originally built GPU version. How can I solve this problem? Thanks.

AnotherSamWilson commented 1 year ago

Hmmm I don't know why that would happen. Could you output the logs? Miceforest only depends on lightgbm >= 3.3.1, if that is already satisfied then the package manager shouldn't replace the version of lightgbm already installed.

ZSZYoung commented 1 year ago

Thank you for your valuable feedback, and I apologize for my delayed response. I have identified the issue, which arose due to a conflict between Mamba (Conda) and Pip.

However, I also noticed an unexpected performance disparity: utilizing a GPU proved to be slower than utilizing a CPU for my task. My task involves imputing a dataset with dimensions of 12000 rows and 300 columns. Some of the features are dense, while others are sparse. It takes around 200 iterations for mean convergence (takes a lot of time).

I'm uncertain whether the sluggish performance is linked to the dataset's relatively small size. I came across information in the documentation mentioning that 'GPUs perform optimally on large, dense datasets. If a dataset is too small, GPU computation might be inefficient due to significant data transfer overhead.'

Do you have any idea? Thank you sincerely.

AnotherSamWilson commented 1 year ago

12000 rows is a small dataset - GPU only becomes faster with millions of rows, according to different benchmarks I've seen around the internet.