MilesCranmer / SymbolicRegression.jl

Distributed High-Performance Symbolic Regression in Julia
https://ai.damtp.cam.ac.uk/symbolicregression/
Apache License 2.0
644 stars 86 forks source link

Issues with simplification backend #59

Closed MilesCranmer closed 2 years ago

MilesCranmer commented 2 years ago

cc @AlCap23

I've noticed many times during training that equations are overly complex, so I think simplification isn't working for some reason. I narrowed it down to this: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/561c9d7270073e28b1c82d43d7c328c5af0ebb1a/src/SimplifyEquation.jl#L153-L155

I think this is too general and is hiding the current issues in the simplifier. When I remove it, I see that many times, simplification is skipped for equations that should normally be easy to simplify.

@AlCap23 do you think you could help me fix the convert functions to capture the current API of SymbolicUtils? I think it will greatly improve the results of EquationSearch. Thanks!

MilesCranmer commented 2 years ago

@AlCap23 this might be useful for debugging:

    catch e
        @warn "Error in symbolic_util simplify: $(e)\nFrom initial equation: $(stringTree(init_node, options))"
        return init_node
    end

where the preamble needs:

using Logging
@from "EquationUtils.jl" import countNodes, stringTree

I have this implemented on the dev branch.

MilesCranmer commented 2 years ago

For example, the following tree:

> tree = Node("x1") * Node("x1") + Node("x1") * Node("x1") + 3f0
> printTree(tree, options)

(((x1 * x1) + (x1 * x1)) + 3.0)

cannot be simplified:

> simplifyWithSymbolicUtils(tree, options)
Error in symbolic_util simplify: ErrorException("Operation ^ in expression not found in operations (+, *, /, -)!")

SymbolicUtils normally converts multiplication into a polynomial basis, then collects common terms, which is fine, because at the end I run them through multiply_powers to convert ^ into repeated multiplication. However it looks like multiply_powers is not being called so the ^ is being left in the SymbolicUtils version of the equation, which results in an error.

multiply_powers: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/561c9d7270073e28b1c82d43d7c328c5af0ebb1a/src/CustomSymbolicUtilsSimplification.jl#L16-L78

MilesCranmer commented 2 years ago

I see a problem:

julia> using SymbolicUtils
julia> @syms x
julia> x * x
x ^ 2

This must have become the default behaviour in a new SymbolicUtils. This is why simplification is broken.

MilesCranmer commented 2 years ago

Raised an issue on SymbolicUtils.jl: #428

cobac commented 2 years ago

Just a comment:

I had a similar issue last year implementing my own symbolic regression program. It also created multiple other issues with my algorithm and I ended up deciding against simplifying expressions at all during the search. I expected it to reduce the size of the space of all possible expressions considerably, but it ended up hurting discoverability a lot. Have you benchmarked your algorithm with and without simplifying equations during the search?

ChrisRackauckas commented 2 years ago

Indeed I think it would help discoverability because it collapses the search space. Note that Symbolics.jl is not "simplifying" the expressions, that's what it looks like to the user but that's somewhat of a misnomer. In reality, Symbolics.jl is doing things like storing expressions like

Add(T, coeff, dict::Dict)

Represents `coeff + (key1 * val1) + (key2 * val2) + ...`

That representation will automatically simplify things when you put things into it, because adding a like key to the dictionary and incrementing will do x + x => 2x. So one reason for this is cheap auto-simplify, which gave around a 1000x speedup. But the other thing that happens is that searches in these representations are a lot leaner. x + 2y and 2y + x have the same representation, so the graph form Term(+,[Term(*,[2,y]),x]) has a combinatorically larger search space, with an exponential growth in the number of redundant representations. simplify on the non-automatic operations thus is a lot faster because it does not have to check all permutations of a + (b + c) vs (a + b) + c etc. because that is handled by the Add canonical form.

What @shashi is doing is creating variable types that do no use this canonical form and instead default to always building Term graphs, though I would think that really using the reduced representations should just end up giving better algorithms by encoding implicit constraints to be explicit.

shashi commented 2 years ago

Hope @syms x::LiteralReal works out here!

MilesCranmer commented 2 years ago

Thanks @shashi! Will work on integrating this.

@ChrisRackauckas, thanks for explaining all of this. Indeed this data structure seems to make a lot of sense and it is unfortunate I can't exploit it here for all cases. (It makes me think of a radix sort - the way each specific term is assigned to a specific bin.) I wonder if there is a way to have a similar data structure, but for arbitrary user-defined binary operators (+ set of rules).

@cobac - I haven't done a head to head comparison personally, but apparently some PySR users have been using earlier versions of PySR because they produced better results and faster. I looked into why this is and it was indeed because the simplification was turned on for those previous versions. I note that simplification only occurs 1 in 1000 mutations in SymbolicRegression.jl (controlled by one of the hyperparams) - which allows the evolution to freely explore equation space without being collapsed into a simplified form. Maybe this is why it doesn't hurt my performance but actually helps it.

AlCap23 commented 2 years ago

Hey, sorry for being late to the party! @MilesCranmer to you still need help or did the proposed solution of @shashi worked? Otherwise, I'll start a PR based on the example above within one week.

MilesCranmer commented 2 years ago

Thanks @AlCap23! I haven't started on this yet so if you have time to work on it I would definitely appreciate the assistance.

To get started, I would revert 3b2608da21a0c9c57282cf40feac7649ef9849d9, bump SymbolicUtils to 0.19, then switch the symbol mappings over to LiteralReal.

The unit test here: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/8d286a258229771f4071e4f57559ba26b893d11e/test/test_simplification.jl#L23-L26 should capture the desired behaviour: whether x1 x1 + x1 x1 simplifies to 2 x1 x1, rather than 2 * x1 ^ 2, given that the user does not have ^ in their operator list.

AlCap23 commented 2 years ago

Sorry for the delay, I have been busy with another PR right now and start moving this week 😅. I'll probably have time in two weeks to have a look. Would that still be alright?

MilesCranmer commented 2 years ago

Sure, no problem! Hope your moving goes well :)

MilesCranmer commented 2 years ago

Another potential solution is to remove the simplification stuff. I'm not actually using it by default now, as I think large expressions slow down things quite a bit and caused some crashes. But the search is still quite good without a simplification algorithm; it seems to simplify automatically just from via the complexity penalty.

If the simplification backend is removed then all I need is the export to SymbolicUtils.jl, which is fairly easy to get working again.

MilesCranmer commented 2 years ago

Fixed on v0.9.0+