Refactor: Looking for implementation strategies to improve run time efficiency of all algorithms regardless of data type (i.e. discrete/continuous, missing data)

EpistasisLab / scikit-rebate

A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.

https://EpistasisLab.github.io/scikit-rebate/

MIT License

408 stars 73 forks source link

Refactor: Looking for implementation strategies to improve run time efficiency of all algorithms regardless of data type (i.e. discrete/continuous, missing data) #39

Open ryanurbs opened 6 years ago

ryanurbs commented 6 years ago

One of the major challenges of making the Relief-based algorithms of ReBATE flexible enough to handle different dataset types, i.e. (1) continuous, discrete, or mixed feature types, (2) binary, multiclass, or continuous outcomes, (3) presence of missing data, is to do so in a way that preserves computational efficiency. Presently scikit-rebate is implemented in a fairly compact manner, however this may not ultimately be the most efficient implementation. This issue posting seeks enhancements to ReBATE and it's underlying algorithms (i.e. ReliefF, SURF, SURF, MultiSURF, MultiSURF, TuRF) to make the respective algorithms run faster, and utilize less memory.

bukson commented 4 years ago

Hello I want to help in this issue, but first of all I will write some tests, that will guarantee that the code after optimization has the same result as code before the optimization.

CaptainKanuk commented 2 years ago

Hi folks - I took an initial pass at this to see if I could proof of concept some changes. I also implemented a benchmarking tool so folks could see how any branch was performing.

See here for my draft PR - it's not ready quite yet as I need to rerun my performance benchmarks. It provides a pattern for one case (ReliefF, binary features, discrete data) that I believe could be generally implemented across all cases to provide clearer code and much more performant operations. I'm working on a full testing benchmark run but initial results for the current parallel ReliefF test on binary/discrete data show a runtime improvement of ~1.85 seconds down to ~.6 seconds for the small testing dataset.

https://github.com/EpistasisLab/scikit-rebate/pull/79