MilesCranmer commented 2 years ago

PySR should be able to reuse parts of an equation.

In theory, a PySR equation (Node) type could have a single subtree used by multiple parts of the expression, as any two nodes can point to the same node.

# Node defines a symbolic expression stored in a binary tree.
# A single `Node` instance is one "node" of this tree, and
# has references to its children. By tracing through the children
# nodes, you can evaluate or print a given expression.
mutable struct Node
    degree::Int  # 0 for constant/variable, 1 for cos/sin, 2 for +/* etc.
    constant::Bool  # false if variable
    val::CONST_TYPE  # If is a constant, this stores the actual value
    # ------------------- (possibly undefined below)
    feature::Int  # If is a variable (e.g., x in cos(x)), this stores the feature index.
    op::Int  # If operator, this is the index of the operator in options.binary_operators, or options.unary_operators
    l::Node  # Left child node. Only defined for degree=1 or degree=2.
    r::Node  # Right child node. Only defined for degree=2. 
end

The l and r attributes point to child nodes. These could in principle point to a node that is also pointed to by another node.

However, is that the best way to reuse parts of an expression? Also, how would copying work, with the pointer structure?

Perhaps all that is needed is to add a "reconnection" mutation, and update the copying procedure to pay attention to re-used pointers.

(Any help appreciated, if someone is interested in this)

MilesCranmer commented 1 year ago

Perhaps a better way of doing this is not to try to reconnect nodes, but to change the definition of compute_complexity to reduce the complexity if a particular subtree is repeated? This may be a bit overkill, but seems like an elegant way to both reward the search for using repetitive patterns, while allowing each pattern to be tweaked if needed.

Might be related to Kolmogorov Complexity: https://en.wikipedia.org/wiki/Kolmogorov_complexity. Is there any Julia package that lets us compute Kolmogov complexity of a particular sequence?

MilesCranmer commented 1 year ago

This might be useful: https://github.com/simonschoelly/InformationDistances.jl

>>> using InformationDistances
>>> compressor = LibDeflateCompressor()
>>> compressed_length(compressor, "abababababab")
6
>>> compressed_length(compressor, "ababababababab")
6
>>> compressed_length(compressor, "abababacbababab")
... # Note the "c"
17

MilesCranmer commented 1 year ago

135 will get part way to solving this.

MilesCranmer commented 1 year ago

Evolutionary aspect

I think we need a few genetic operators (in a new PR):

[ ] create_shared_expression Connect an operator to a different child node in the tree, so long as that child node is not a parent or grandparent. (The garbage collector should automatically delete the old child node, if no other nodes reference them.)
- Might think about doing an explicit search here for similar subtrees, instead of doing this completely randomly. I could see benefits of both routes.
[ ] break_shared_expression Break links to multiple parent nodes, by creating a separate copy for each parent.
- This mutation should only be possible if a tree has a shared subexpression.
- Might want to have a function to check this - it should be as simple as making an IdDict{Node{T},Int} to count the number of times a node appears in a tree. If any values in the dict are greater than 1, then there is a shared expression (and the check could immediately return).

Complexity calculations

[x] Need to modify compute_complexity and count_nodes to not double count.
[ ] Also, it might be worth it to re-weight the other mutation operators to account for a node being in multiple subtrees.

Other utils

[x] Need to look through library to assert that no other copy_node is being done inside a function. For example, I think convert(::Type{Node{T1}}, tree::Node{T2}) might be doing a copy?

Crossovers

[ ] I think a crossover should work as-is, since any copied node will have all its children copied if they don't already exist in the copied tree. But would be good to have some tests for this.

Evaluation

[ ] (lower-priority, since not really needed). It would be great to re-use evaluation results when a node is shared, rather than duplicate each time.

Printing

[ ] How should an expression with shared subexpressions be printed? Should it be something like: y = x1 - 3.2 * x2; z=cos(y) + y, or should the entire closed-form expression be printed?

@Jgmedina95 would you be up for implementing any of these?

viktmar commented 1 year ago

Sorry if I haven't seen your respective version of the following code. But I couldn't find it, so I propose the following for the complexity. It counts reused nodes with the complexity of 1 and does not continue traversing

function count_nodes_unique(tree::Node; tree_list=Node[], compl=0)
    compl += 1
    if !(tree in tree_list)
        push!(tree_list, tree)
        if tree.degree == 1
            compl_l, tree_list = count_nodes_unique(tree.l, tree_list=tree_list)
            compl += compl_l
        elseif tree.degree == 2
            compl_l, tree_list = count_nodes_unique(tree.l, tree_list=tree_list)
            compl_r, tree_list = count_nodes_unique(tree.r, tree_list=tree_list)
            compl += compl_l + compl_r
        end
        return compl, tree_list
    end
    return compl, tree_list
end

MilesCranmer commented 1 year ago

Hi @viktmar, I think this may not work because in measures equality == rather than reference equality ===. The current version in the shared-node-utils branch solves this as follows:

function _count_nodes(tree::Node{T}, nodes_seen::ID)::Int where {T,ID}
    !(ID <: Nothing) && haskey(nodes_seen, tree) && return 0
    count = if tree.degree == 0
        1
    elseif tree.degree == 1
        1 + _count_nodes(tree.l, nodes_seen)
    else
        1 + _count_nodes(tree.l, nodes_seen) + _count_nodes(tree.r, nodes_seen)
    end
    !(ID <: Nothing) && (nodes_seen[tree] = true)
    return count
end

Where nodes_seen is an IdDict{Node,Bool}. I did try simply passing in a vector of UInt, and pushing the objectid, but I found IdDict is faster for some reason. Cheers, Miles

MilesCranmer commented 1 year ago

Actually I take that back, it seems like in is actually looking for === rather than ==. Perhaps that's a faster way to check than IdDict...?

MilesCranmer commented 1 year ago

It's working!!!

Here's an example (on the shared-node-utils branch)

using SymbolicRegression

X = rand(Float32, 2, 100)
# We will repeat this part:
y = (X[1, :] .- X[2, :] .+ 1)
# Sum of powers:
y = y .^ 1.5 .+ y .^ 2.1 .+ y .^ 0.8

# Subtract mean of y:
y = y .- mean(y)

options = SymbolicRegression.Options(;
    binary_operators=(+, -, ^),  # npopulations=20
    complexity_of_variables=3,
    maxsize=40,
    constraints=[(^) => (-1, 1)],
    mutationWeights=[0.048, 0.47, 0.79, 5.1, 1.7, 0.0020, 0.00023, 0.05, 0.05, 0.21],
    # The third to last and second to last are make random connection, and break random connection
)

hall_of_fame = EquationSearch(X, y; niterations=4000, options=options, multithreading=true)

This seems to correctly build x1 - x2 + 1 as the subexpression, and then use it in power laws! It does take a while, though, so it's probably best if there is a specific flag to turn on shared expressions or not.

Printed expressions use {} to indicate that a node is shared. Sometimes you will see {} inside {} - that means it is sharing a node which, in that tree, shares another node!

Side note: would be good to refactor the mutation weights so that it's not just a long vector of numbers... Probably best to have it as a separate struct.

MilesCranmer commented 1 year ago

cc @CharFox1 @johanbluecreek

MilesCranmer commented 1 year ago

The current search is so slow with this that I think some functions are breaking hierarchies somehow. I think combine_operators and simplify_equation might(?) be breaking things... Need to investigate.

I think one needs to be careful about moving nodes around if those nodes are referenced by something - since they could end up a parent to their own parent!

dan-blank commented 2 months ago

@MilesCranmer That is a really interesting idea! My only contact so far was during my thesis, where I built a translation from a proof tree in our proof language into a sequential proof in another language (SMTCoq). I simplified first (simple tree traversal, no caching), and then for the more complicated phases, build a table of shared sub expressions (maximally shared, everything that was used more than once got a variable), swapping out the concrete sub trees with a variable pointing to the correct entry in the table. Those variables where variables like any other. That worked well, but it was a much less dynamic situation (proof in, altered proof out) than genetic programming.

My thoughts about this are:

One has to be careful about what one means by "sharing" exactly - one meaning could be "A and B use the same syntax", another could be "C and D refer to the same entity" (like object entity in OOP. Regarding simplification for example, A and B would then always be connected - since they evaluate to the same value - but might look different syntax-wise.) So there would be at least two kinds of bonds that could make sense and that would need to be considered during tree manipulation.
- Also, just because two expressions A and B might be syntactically identical, does not mean that they should be simplified in the same way (to stay at the topic of simplification) - B could be included in a different context than A, enabling a better simplification to be applied.
In Rust, they solve a lot of referential stuff with indexes that then are looked up (the borrow checker can make some situations which involve a lot of references and cross-references tricky).
I find anything other than structural equality ("same data -> same thing") hard to reason about (that might be a personal thing though!). Still, my advice would be to keep the identity equalities to the necessary minimum (that is, providing the links of the tree) and use indexes/hashes if the need arises.

Hope that made sense! By the way, I would be interested in working on simplifications (that would be https://github.com/SymbolicML/DynamicExpressions.jl, right?), if you would be interested :)

MilesCranmer / SymbolicRegression.jl

[Feature] Reusing parts of equation #113

135 will get part way to solving this.

Evolutionary aspect

Complexity calculations

Other utils

Crossovers

Evaluation

Printing