AnotherSamWilson / miceforest

Multiple Imputation with LightGBM in Python
MIT License
353 stars 31 forks source link

Performance #75

Closed sandrejev closed 3 months ago

sandrejev commented 1 year ago

Hi, first of all great package, both miceRanger and miceForest, congratulations! I am imputing large dataset 25000x6000 of proteomic data. miceRanger gives the best performance and it is also much faster than many other algorithms, but I want to speed up the process as this will be a pipeline with more datasets to come. Right now I am using miceRanger in R but I was wondering whether miecForest is faster (for whatever reason). Have you ever compared the two?

AnotherSamWilson commented 1 year ago

Thanks for looking at both packages. My personal testing matches your observations. I have a few thoughts: 1) lightgbm isn't a random forest library by default. Yes, it has random forest capabilities, but it suffers from 1 fatal flaw. It does not use resampling (bagging). This results in worse performance from lightgbm random forests than ranger random forests. 2) lightgbm isn't necessarily faster. lightgbm is extremely fast, but because of the way it is set up, it parallelizes the column search for optimal splits, instead of parallelizing the tree building. Ranger parallelizes the tree building, which results in less cpu idle time. 3) sklearn random forests take up a TON of memory. This package originally used sklearn random forests, but I eventually had to switch because they were too slow and memory intensive.

Overall miceRanger gives better performance, and I think it's faster. If you want to speed up performance, try feeding different parameters to ranger through the dots, such as subsample fraction and mtry. You can also decrease the max_depth to make it much faster.

sandrejev commented 1 year ago
Hello Sam, Thank you for your reply. I was digging into R code. I think my the bottleneck is the number of variables. In miceRanger you have a loop over all columns that have NA that is not parallelized. Reconstructing one variable doesn’t take much time but with a 1000 of them these seconds add up. Sadly my internet doesn’t work properly now but a solution that would parallelizes list of variables in chunks should increase performance dramatically for large datasets. Ranger on it’s own uses some parallelization (I believe trhough num.threads) but this gave me only around 20% improvement on 60 vs 4 cores (and maybe 40% on 60 vs 1). Best Regards,Sergej From: Samuel Von WilsonSent: Thursday, April 27, 2023 7:58 PMTo: AnotherSamWilson/miceforestCc: Sergej Andrejev; AuthorSubject: Re: [AnotherSamWilson/miceforest] Performance (Issue #75) Thanks for looking at both packages. My personal testing matches your observations. I have a few thoughts:lightgbm isn't a random forest library by default. Yes, it has random forest capabilities, but it suffers from 1 fatal flaw. It does not use resampling (bagging). This results in worse performance from lightgbm random forests than ranger random forests.lightgbm isn't necessarily faster. lightgbm is extremely fast, but because of the way it is set up, it parallelizes the column search for optimal splits, instead of parallelizing the tree building. Ranger parallelizes the tree building, which results in less cpu idle time.sklearn random forests take up a TON of memory. This package originally used sklearn random forests, but I eventually had to switch because they were too slow and memory intensive.Overall miceRanger gives better performance, and I think it's faster. If you want to speed up performance, try feeding different parameters to ranger through the dots, such as subsample fraction and mtry. You can also decrease the max_depth to make it much faster.—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>