Open ryanurbs opened 6 years ago
Hello I want to help in this issue, but first of all I will write some tests, that will guarantee that the code after optimization has the same result as code before the optimization.
Hi folks - I took an initial pass at this to see if I could proof of concept some changes. I also implemented a benchmarking tool so folks could see how any branch was performing.
See here for my draft PR - it's not ready quite yet as I need to rerun my performance benchmarks. It provides a pattern for one case (ReliefF, binary features, discrete data) that I believe could be generally implemented across all cases to provide clearer code and much more performant operations. I'm working on a full testing benchmark run but initial results for the current parallel ReliefF test on binary/discrete data show a runtime improvement of ~1.85 seconds down to ~.6 seconds for the small testing dataset.
One of the major challenges of making the Relief-based algorithms of ReBATE flexible enough to handle different dataset types, i.e. (1) continuous, discrete, or mixed feature types, (2) binary, multiclass, or continuous outcomes, (3) presence of missing data, is to do so in a way that preserves computational efficiency. Presently scikit-rebate is implemented in a fairly compact manner, however this may not ultimately be the most efficient implementation. This issue posting seeks enhancements to ReBATE and it's underlying algorithms (i.e. ReliefF, SURF, SURF, MultiSURF, MultiSURF, TuRF) to make the respective algorithms run faster, and utilize less memory.