JuliaAI / MLJLinearModels.jl

Generalized Linear Regressions Models (penalized regressions, robust regressions, ...)
MIT License
81 stars 13 forks source link

Feature Requests #54

Open azev77 opened 4 years ago

azev77 commented 4 years ago

@tlienart

  1. All Subset (Best K Subset): I've got parsimonious code for All Subset Regression which might have a nice home in this package that I'd like to share. While it's much faster than R's Leaps pkg, I would include a strong warning to users: This optimization problem is sooo not convex. Avoid for p>20.

  2. Approximate All Subset (MIO): As you probably know, Bertsimas etal (2016): show how to use MIO to approximately train all subset regression for p =1000 in minutes. Bertsimas et al (2019): do this for p=100k in minutes. The corresponding Julia package is SubsetSelection.jl. Also see forthcoming Julia package MIP-Boost. It would be really awesome if these linear models were featured in MLJ.

  3. Relaxed Lasso: Hastie et al 2017: responded that relaxed Lasso matches Lasso in low-SNR scenarios & beats All Subset in high-SNR scenarios glmnet: now features "relaxed Lasso" Would this be difficult to add to MLJLinearModels? (maybe we can work on it together?)

  4. Linear Interactions: There are some cool new packages for interactions in linear models. I find these compare in out-of-sample fit w/ non-linear models (xgboost etc, no joke) R: sprintr & HierNet Python: HierScale

  5. Nonconvex penalties (MCP SCAD): https://github.com/joshday/SparseRegression.jl/blob/master/README.md

  6. Forward selection/stepwise regression: (stackoverflow, has Julia code) Textbook code

  7. Ensembles of Penalized Linear Models: Christidis et al 2019: Paper, old code EnsembleEN, new code SplitReg

tlienart commented 4 years ago

It would be fantastic to have someone else to help me with this package and grow some (possibly experimental) functionalities.

I'm committed to maintaining the package but if you would like to join the team, it would be great! and the things that you're mentioning would definitely be very nice additions

azev77 commented 4 years ago

Here is the simplest Best Subset regression:

using Combinatorics; #to generate all subsets
#g() trains every subset
@inline function g(X, y, train, test, n_ops, subs, res)
    @inbounds for m = 1:n_ops
        b  = X[train, subs[m]] \ y[train];
        ê  = X[test,  subs[m]]*b - y[test];   #oos residuals ê=y-ŷ
        res[m] = ê'ê;
    end
    return res
end
#f() trains using all columns of X
@inline function f(X, y, train, test)
    b  = X[train, :] \ y[train];
    ê  = X[test,  :]*b - y[test];   #oos residuals ê=y-ŷ
    sse = ê'ê;              #SSE
    return sse
end

#Test Drive
using MLJ;
X, y =  @load_boston;
Xnames = keys(X);
X = hcat(X...); #convert to array.
train, test = partition(eachindex(y), .7, rng=333);
#
p = size(X,2); #p is the only HP
n_ops = 2^p -1; res = (1e+5)*ones(n_ops);
subs = collect(powerset(1:p, 1))                 #combinatorics reduce dependency 
@time f(X, y, train, test)
@time g(X, y, train, test, n_ops, subs, res)
extrema(res)
subs[argmin(res)] #variables of best LM
Xnames[ subs[argmin(res)] ]

#using BenchmarkTools
#@benchmark f(X,y)
#@benchmark g(X,y, n_ops, subs, res)

Of course this needs to be cross-validated etc. This is a valuable learning tool to remind students that the model w/ the most variables won't necessarily have the best out of sample predictive power...
Also, this is a HELLUVALOT slower in R.

The other valuable lesson from this is that it's impossible to train all models. The Boston data is a toy example w/ p=12 (>4000 models) that can be done in seconds. But the Ames housing data w/ p=79, (2^79 models) would take lifetimes.

tlienart commented 4 years ago

Thanks a lot for this! I'm a bit swamped at the moment but will try to look into this soon. And if anything it can make a nice tutorial too!