DOI-BOR / PyForecast

PyForecast is a statistical modeling tool used by Reclamation water managers and reservoir operators to train and build predictive models for seasonal inflows and streamflows. PyForecast allows users to make current water-year forecasts using models developed with the program.
Other
28 stars 12 forks source link

Brute Force Feature Selection? #10

Closed tjrocha closed 5 years ago

tjrocha commented 5 years ago

Just spit-balling here... Would it be useful to have a Brute Force Feature Selection method incorporated into the software? For some relatively simple models (#Predictors = 15), we can just brute force evaluate all the possible combinations of predictors (#Combinations=32,767) and have the program report the top performing models. Benefits to this become more apparent if there are fewer predictors since the brute force number of equations are of the form (2^n)-1 with n=#Predictors.

On a related note, I noticed that the existing Feature Selection algorithm evaluates the same model multiple times especially if it is not selected/stored in the list of viable regression models. There might be some performance gains to maybe storing in memory just the salient metrics (Predictor IDs & Selected Metric) for every model run and referring to this in-memory object so the algorithm doesn't evaluate the same model multiple times.

jslanini commented 5 years ago

I think this could be useful, particularly for testing the efficacy of the feature selection algorithm(s). Would you do it automatically for any predictor set less than an upper limit, or allow the user to select it as an option and warn them if they are going to overwhelm the computer?

tjrocha commented 5 years ago

Thanks for the feedback. I was thinking of making it a selectable option available in the Regression Tab right above the Forward and Backwards Selection methods. Yeah, warning the user would be a good idea based on whether the number of equations to evaluate exceeds an arbitrary large number.

The issue would be run-time since the program would have to evaluate all the models but in terms of memory, we can have it only store the top X models so if new better performing models are evaluated, they can just knock-off a worse performing model so we don't run out of memory and overwhelm the computer.

tjrocha commented 5 years ago

Fully working and tested implementation with 8112f27.