Simplifying constants in result or during search

johanbluecreek commented 2 years ago

I have used this package for a work-flow that is basically: Solve a differential equation (DE) using DifferentialEquations.jl and then giving that solution to SymbolicRegression.jl to find the analytical expression. From the DE it is quite obvious that the resulting expression should only contain integers or rationals. The optimization that is carried out for constants in the expressions gets very close, but it would be nice if there was either:

a function that can, post-search, fix numerical constants to some exact numerical constant, or
an option for the search that would act with some such function, as described above, on the candidate expressions during the search

Is there such functionality already in SymbolicRegression.jl that I have missed, or would it be useful to have?

Simple examples would be having solutions like 5.0000000002*x1, or whatever, rounded to 5.0 * x1, more advanced being the rationals having 0.200000002 * x2 fixed to x2 / 5.0 or similar, or more advanced still sin(x1 * 0.31830988618454) to sin(x1 / π). In SymPy there is a function nsimplify that can handle the numerical part of such functionality, and works quite well, e.g.

>>> from sympy import nsimplify, GoldenRatio, pi, E
>>> nsimplify(0.865255979442955, [GoldenRatio, pi, E], full=True)
E/pi

but I don't know if there is a Julia package that does the same thing.

MilesCranmer commented 2 years ago

Interesting question. I see two options:

1. Post-process expressions.

You could do is use SymbolicRegression.jl like an nsimplify on steroids. For example, let's say we want to find an Ramanujan-like approximation to pi. To do this, you pass integers as features, and set complexity_of_constants=100 - this will make real constants too complex to use.

options = Options(
    # Make constants prohibitively expensive:
    complexity_of_constants=100, 
    unary_operators=(sqrt, square), 
    binary_operators=(+, *, /, -), 
    mutationWeights=[0.0, 0.47, 0.79, 5.1, 1.7, 0.0020, 0.00023, 0.21],
    # ^ Set p(mutate_constant)=0.0
    shouldOptimizeConstants=false,
    # ^ Set constant optimization off (so we don't waste cycles)
    parsimony=0.001,
)
# Integers from 1 to 10:
X = reshape(collect(1.0:10.0), 10, 1)
y = [float(pi)]

EquationSearch(X, y; options=options, multithreading=true, niterations=1000)

This gives me as output:

which is actually pretty good, recovering many approximations of pi! https://en.wikipedia.org/wiki/Approximations_of_%CF%80 (I wonder if some of these are even known...?)

You could do something similar for other constants you wish to include - you could also set varMap to make the printout give the constants in such a case.

2. Search directly for integer/rationals.

One option is to basically apply the solution 1., but use it for the search itself. (i.e., concatenate the constants/integers with the data features, at each row)

More generally (say you want to include any integer), this is a bit tricky, especially because Optim.jl only can optimize float-like constants. But you could basically rely on the mutate_constant function (here) to explore the space of integers.

Right now, the library assumes that Node{T} and Dataset{T} have the same type T. In other words: expression constants have the same type as the element type of the dataset. However, this isn't really necessary - one could rewrite the backend to allow for Node{T1}, Dataset{T2}, which would allow you directly store integers or rational numbers in the expressions, while still computing with float precision. One could then look at using JuMP.jl in place of Optim.jl for explicit integer/rational optimization, if such a thing is possible.

Another option is to implement a function

function val(tree::Node{T})::Int
    convert(Int, tree.val)
end

and make it so that every time tree.val is called, you would instead call val(tree). You could then set that output type (here, an Int), to some user-specified type. I think the Node{T1} / Dataset{T2} solution is more elegant though.

qwertyjl commented 2 years ago

comment deleted [sorry for the off topic question]

MilesCranmer commented 2 years ago

Hi @qwertyjl - I think this question is unrelated to the use of SymbolicRegression.jl/PySR - apologies but I do not have time to answer general math/science questions. Best, Miles

dan-blank commented 5 months ago

@MilesCranmer

Right now, the library assumes that Node{T} and Dataset{T} have the same type T. In other words: expression constants have the same type as the element type of the dataset. However, this isn't really necessary - one could rewrite the backend to allow for Node{T1}, Dataset{T2}, which would allow you directly store integers or rational numbers in the expressions, while still computing with float precision. One could then look at using JuMP.jl in place of Optim.jl for explicit integer/rational optimization, if such a thing is possible.

That would indeed be more elegant I think!

Nelder Mead, the simplex optimization, looks like it could be made to work on integers (although I assume I will not be optimal). An alternative would be to implement a really simple optimization for integers ("add/substract 1 to a constant") together with a warning and offer the user a way to supply a custom optimizer if they want (i.e. they can write a function that takes the same arguments as the call of Optim.jl). What do you think about that?

In any case, the splitting into Node{T1}, Dataset{T2} needs to deal with the optimization too I think. Also with the simplification, since stuff like "const1 / const2" might introduce rationals into a formula that should only contain integers.

I just came across Julia and this project, so I hope I do not talk too much nonsense! :)

Thank you for creating this great project!

@johanbluecreek Locally, I made SymbolicRegression.jl generate any integer by using these adaptions:

(Like Miles hinted) I changed mutate_constant to basically this:

newval = node.val + convert(T, rand(rng, -5:5))
node.val = newval

make_random_leaf now looks like this:

function make_random_leaf(
nfeatures::Int, ::Type{T}, ::Type{N}, rng::AbstractRNG=default_rng()
) where {T<:DATA_TYPE,N<:AbstractExpressionNode}
if rand(rng, Bool)
    return constructorof(N)(; val=convert(T, rand(rng, -50:50))) # changed to return integers
else
    return constructorof(N)(T; feature=rand(rng, 1:nfeatures))
end
end

Also set should_simplify=false, since otherwise constants might get optimised into non-integers. However I noticed that one sometimes can leave it on, since those (mostly small) results of simplification can get dominated by newly introduced integer constants.

MilesCranmer / SymbolicRegression.jl

Simplifying constants in result or during search #128

1. Post-process expressions.

2. Search directly for integer/rationals.