DOI-BOR / PyForecast

PyForecast is a statistical modeling tool used by Reclamation water managers and reservoir operators to train and build predictive models for seasonal inflows and streamflows. PyForecast allows users to make current water-year forecasts using models developed with the program.
Other
28 stars 12 forks source link

Test Different Regression Schemes in StatsModels Package #24

Open tjrocha opened 5 years ago

tjrocha commented 5 years ago

Would be nice to test if better models can be generated using the regression schemes available in the StatsModels Python package.

Linear Regression - https://www.statsmodels.org/stable/examples/index.html#regression Generalized Linear - https://www.statsmodels.org/stable/examples/index.html#glm Discrete Choice - https://www.statsmodels.org/stable/examples/index.html#discrete

tjrocha commented 5 years ago

Swapping out the ordinary-least-squares (OLS) regression from Numpy with the statsmodels implementation of OLS shows virtually no difference with some preliminary tests. Using statsmodels OLS comes with the added benefit of exposing some useful diagnostics for model analysis.

Statsmodels OLS Sample Output:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.101
Model:                            OLS   Adj. R-squared:                  0.067
Method:                 Least Squares   F-statistic:                     3.025
Date:                Mon, 15 Jul 2019   Prob (F-statistic):             0.0934
Time:                        13:42:18   Log-Likelihood:                -147.19
No. Observations:                  29   AIC:                             298.4
Df Residuals:                      27   BIC:                             301.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.3562      0.205      1.739      0.093      -0.064       0.776
const         51.5287     16.175      3.186      0.004      18.341      84.717
==============================================================================
Omnibus:                        5.472   Durbin-Watson:                   2.440
Prob(Omnibus):                  0.065   Jarque-Bera (JB):                4.610
Skew:                           0.977   Prob(JB):                       0.0998
Kurtosis:                       2.990   Cond. No.                         171.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Statsmodels OLS Output Interpretation https://blog.datarobot.com/ordinary-least-squares-in-python

tjrocha commented 5 years ago

@kevinfol @jslanini - Thoughts? Should we pull the trigger on this?

jslanini commented 5 years ago

AIC and Mallow's Cp are equivalent for OLS... so this is useful from that standpoint. How do they fit the distribution to determine confidence intervals? Also, we should be using prediction intervals rather than confidence intervals.

kevinfol commented 5 years ago

Is there any benefit to running statsmodels during the selection process? I would think that the numpy routine is much faster. It seems like that might slow things down a bunch. Maybe we should just run statsmodels after the selection routine chooses the best performing models. Then we'll still get the CI's without slowing everything down.

tjrocha commented 5 years ago

Ok. The numpy implementation is faster. Benchmark timing test and results shown below.

As Kevin suggests, we can use the statsmodels regression for a more in-depth diagnostic of the numpy selected models and probably for the random sampling of the regressed coefficients using the CI distribution. If we want to use some of the statsmodels built-in model-performance metrics in the model selection process, then the trade-off would be felt in regression analysis speed. beu.zip

--------------------- RUN INFO ------------------------
beu.fcst 
30-year dataset
No missing data
9 datasets in Jan-01st predictor pool
SFFS selection method
50 models
Leave-one-out CV method
CVAR scoring method
Normal Distribution

RUN-1 NUMPY
---------------------- BENCHMARK ----------------------
Started MLR Feature Selection at 1563291069.1148264
Finished MLR Feature Selection at 1563291139.26445
Time Elapsed: 70.14962363243103
---------------------- BENCHMARK ----------------------
Started PCAR Feature Selection at 1563291284.8808424
Finished PCAR Feature Selection at 1563291339.1873171
Time Elapsed: 54.306474685668945
---------------------- BENCHMARK ----------------------
Started ZSCR Feature Selection at 1563291366.2312653
Finished ZSCR Feature Selection at 1563291385.2428422
Time Elapsed: 19.011576890945435

RUN-2 STATSMODELS
---------------------- BENCHMARK ----------------------
Started MLR Feature Selection at 1563292321.7858927
Finished MLR Feature Selection at 1563292452.5308974
Time Elapsed: 130.74500465393066
---------------------- BENCHMARK ----------------------
Started PCAR Feature Selection at 1563292471.0299952
Finished PCAR Feature Selection at 1563292596.8994033
Time Elapsed: 125.86940813064575
---------------------- BENCHMARK ----------------------
Started ZSCR Feature Selection at 1563292614.8862555
Finished ZSCR Feature Selection at 1563292644.5180862
Time Elapsed: 29.63183069229126

RUN-3 STATSMODELS 
---------------------- BENCHMARK ----------------------
Started MLR Feature Selection at 1563292697.4007733
Finished MLR Feature Selection at 1563292825.3952856
Time Elapsed: 127.99451231956482
Models Analyzed 9040
---------------------- BENCHMARK ----------------------
Started PCAR Feature Selection at 1563292889.596083
Finished PCAR Feature Selection at 1563293016.8857193
Time Elapsed: 127.2896363735199
Models Analyzed 3140
---------------------- BENCHMARK ----------------------
Started ZSCR Feature Selection at 1563293044.44604
Finished ZSCR Feature Selection at 1563293074.5822127
Time Elapsed: 30.136172771453857
Models Analyzed 2985

RUN-4 NUMPY
---------------------- BENCHMARK ----------------------
Started MLR Feature Selection at 1563293159.5574477
Finished MLR Feature Selection at 1563293225.5578628
Time Elapsed: 66.00041508674622
Models Analyzed 9040
---------------------- BENCHMARK ----------------------
Started PCAR Feature Selection at 1563293248.3344646
Finished PCAR Feature Selection at 1563293304.002554
Time Elapsed: 55.668089389801025
Models Analyzed 3140
---------------------- BENCHMARK ----------------------
Started ZSCR Feature Selection at 1563293337.7362983
Finished ZSCR Feature Selection at 1563293355.5186052
Time Elapsed: 17.782306909561157
Models Analyzed 2985

Actual Code in FeatureSelectionV3.py MultipleRegression() method

        """ Fit the model and get the coefficients """
        model = np.linalg.lstsq(x.T, training_predictand, rcond=None)
        coefs = model[0][:-1]
        intercept = model[0][-1]

        #model = sm.OLS(training_predictand, x.T)
        #results = model.fit()
        #print(results.summary())
        #coefs = np.array([results.params[:-1]]).T
        #intercept = np.array([results.params[-1]])