Closed baggepinnen closed 2 years ago
This sort of comparison is completely expected. Any algorithm that searches on a small subspace of equation space is going to beat a genetic algorithm which searches the complete equation space. e.g., for typical numbers, the linear equation space would have 10^10 fewer combinatorial possibilities than the full equation space.
I like the idea that allowing the user to put in priors on the equation, such as a high frequency of polynomial terms, would be really useful for these sorts of common equations! Maybe a polynomial search could be part of the internal loop? Although this is very specific to these sorts of equations, whereas SymbolicRegression.jl can work with arbitrary (even non-differentiable) operators.
(The core algorithm improved a lot over the past year - it should do fine at polynomial searches now, so long as the true equation is a relatively small polynomial)
I noticed that it took quite some time for the symbolic regression to beat a simple variable subset selection using linear regression. To clarify what I mean with this, I consider a large regressor matrix
A
in the problemy = A*b
whereb
are the parameters and the task is to select a subset of the columns ofA
to include in the linear model. This can typically be solved to near optimality with LASSO regression. After running for long enough, the symbolic regression did indeed find a better equation for my toy problem at a similar number of estimated parameters that did my subset selection, but it makes me wonder if subset selection can be used as a heuristic approach to seed some of the population members?For reference, I include a naive, brute force algorithm that selects
n
variables from a regressor matrixA
in an exact way, in case my above explanation didn't make sense