MilesCranmer / SymbolicRegression.jl

Distributed High-Performance Symbolic Regression in Julia
https://ai.damtp.cam.ac.uk/symbolicregression/
Apache License 2.0
621 stars 81 forks source link

Simplifying constants in result or during search #128

Open johanbluecreek opened 2 years ago

johanbluecreek commented 2 years ago

I have used this package for a work-flow that is basically: Solve a differential equation (DE) using DifferentialEquations.jl and then giving that solution to SymbolicRegression.jl to find the analytical expression. From the DE it is quite obvious that the resulting expression should only contain integers or rationals. The optimization that is carried out for constants in the expressions gets very close, but it would be nice if there was either:

Is there such functionality already in SymbolicRegression.jl that I have missed, or would it be useful to have?

Simple examples would be having solutions like 5.0000000002*x1, or whatever, rounded to 5.0 * x1, more advanced being the rationals having 0.200000002 * x2 fixed to x2 / 5.0 or similar, or more advanced still sin(x1 * 0.31830988618454) to sin(x1 / π). In SymPy there is a function nsimplify that can handle the numerical part of such functionality, and works quite well, e.g.

>>> from sympy import nsimplify, GoldenRatio, pi, E
>>> nsimplify(0.865255979442955, [GoldenRatio, pi, E], full=True)
E/pi

but I don't know if there is a Julia package that does the same thing.

MilesCranmer commented 2 years ago

Interesting question. I see two options:

1. Post-process expressions.

You could do is use SymbolicRegression.jl like an nsimplify on steroids. For example, let's say we want to find an Ramanujan-like approximation to pi. To do this, you pass integers as features, and set complexity_of_constants=100 - this will make real constants too complex to use.

options = Options(
    # Make constants prohibitively expensive:
    complexity_of_constants=100, 
    unary_operators=(sqrt, square), 
    binary_operators=(+, *, /, -), 
    mutationWeights=[0.0, 0.47, 0.79, 5.1, 1.7, 0.0020, 0.00023, 0.21],
    # ^ Set p(mutate_constant)=0.0
    shouldOptimizeConstants=false,
    # ^ Set constant optimization off (so we don't waste cycles)
    parsimony=0.001,
)
# Integers from 1 to 10:
X = reshape(collect(1.0:10.0), 10, 1)
y = [float(pi)]

EquationSearch(X, y; options=options, multithreading=true, niterations=1000)

This gives me as output:

Screen Shot 2022-09-10 at 1 13 11 PM

which is actually pretty good, recovering many approximations of pi! https://en.wikipedia.org/wiki/Approximations_of_%CF%80 (I wonder if some of these are even known...?)

You could do something similar for other constants you wish to include - you could also set varMap to make the printout give the constants in such a case.

2. Search directly for integer/rationals.

One option is to basically apply the solution 1., but use it for the search itself. (i.e., concatenate the constants/integers with the data features, at each row)

More generally (say you want to include any integer), this is a bit tricky, especially because Optim.jl only can optimize float-like constants. But you could basically rely on the mutate_constant function (here) to explore the space of integers.

Right now, the library assumes that Node{T} and Dataset{T} have the same type T. In other words: expression constants have the same type as the element type of the dataset. However, this isn't really necessary - one could rewrite the backend to allow for Node{T1}, Dataset{T2}, which would allow you directly store integers or rational numbers in the expressions, while still computing with float precision. One could then look at using JuMP.jl in place of Optim.jl for explicit integer/rational optimization, if such a thing is possible.

Another option is to implement a function

function val(tree::Node{T})::Int
    convert(Int, tree.val)
end

and make it so that every time tree.val is called, you would instead call val(tree). You could then set that output type (here, an Int), to some user-specified type. I think the Node{T1} / Dataset{T2} solution is more elegant though.

qwertyjl commented 2 years ago

comment deleted [sorry for the off topic question]

MilesCranmer commented 2 years ago

Hi @qwertyjl - I think this question is unrelated to the use of SymbolicRegression.jl/PySR - apologies but I do not have time to answer general math/science questions. Best, Miles

dan-blank commented 5 months ago

@MilesCranmer

Right now, the library assumes that Node{T} and Dataset{T} have the same type T. In other words: expression constants have the same type as the element type of the dataset. However, this isn't really necessary - one could rewrite the backend to allow for Node{T1}, Dataset{T2}, which would allow you directly store integers or rational numbers in the expressions, while still computing with float precision. One could then look at using JuMP.jl in place of Optim.jl for explicit integer/rational optimization, if such a thing is possible.

That would indeed be more elegant I think!

Nelder Mead, the simplex optimization, looks like it could be made to work on integers (although I assume I will not be optimal). An alternative would be to implement a really simple optimization for integers ("add/substract 1 to a constant") together with a warning and offer the user a way to supply a custom optimizer if they want (i.e. they can write a function that takes the same arguments as the call of Optim.jl). What do you think about that?

In any case, the splitting into Node{T1}, Dataset{T2} needs to deal with the optimization too I think. Also with the simplification, since stuff like "const1 / const2" might introduce rationals into a formula that should only contain integers.

I just came across Julia and this project, so I hope I do not talk too much nonsense! :)

Thank you for creating this great project!


@johanbluecreek Locally, I made SymbolicRegression.jl generate any integer by using these adaptions: