Variable subset selection for linear regression

baggepinnen commented 3 years ago

I noticed that it took quite some time for the symbolic regression to beat a simple variable subset selection using linear regression. To clarify what I mean with this, I consider a large regressor matrix A in the problem y = A*b where b are the parameters and the task is to select a subset of the columns of A to include in the linear model. This can typically be solved to near optimality with LASSO regression. After running for long enough, the symbolic regression did indeed find a better equation for my toy problem at a similar number of estimated parameters that did my subset selection, but it makes me wonder if subset selection can be used as a heuristic approach to seed some of the population members?

For reference, I include a naive, brute force algorithm that selects n variables from a regressor matrix A in an exact way, in case my above explanation didn't make sense

using IterTools, TotalLeastSquares
function ls_selection(A,y, n)
    m = size(A, 2)
    errors = Float64[]
    indices = Vector{Int}[]
    for ni = 1:n
        for inds = IterTools.subsets(1:m, ni)
            At = A[:,inds]
            θ = At \ y
            E = y - At*θ
            e = sum(abs, E)
            push!(errors, e)
            push!(indices, inds)
        end
    end
    mi = indices[argmin(errors)]
    mi, errors, indices
end

MilesCranmer commented 3 years ago

This sort of comparison is completely expected. Any algorithm that searches on a small subspace of equation space is going to beat a genetic algorithm which searches the complete equation space. e.g., for typical numbers, the linear equation space would have 10^10 fewer combinatorial possibilities than the full equation space.

I like the idea that allowing the user to put in priors on the equation, such as a high frequency of polynomial terms, would be really useful for these sorts of common equations! Maybe a polynomial search could be part of the internal loop? Although this is very specific to these sorts of equations, whereas SymbolicRegression.jl can work with arbitrary (even non-differentiable) operators.

also see https://github.com/MilesCranmer/PySR/issues/15

MilesCranmer commented 2 years ago

(The core algorithm improved a lot over the past year - it should do fine at polynomial searches now, so long as the true equation is a relatively small polynomial)

MilesCranmer / SymbolicRegression.jl

Variable subset selection for linear regression #10