MilesCranmer / SymbolicRegression.jl

Distributed High-Performance Symbolic Regression in Julia
https://astroautomata.com/SymbolicRegression.jl/dev/
Apache License 2.0
576 stars 71 forks source link

[Feature] Reusing parts of equation #113

Open MilesCranmer opened 2 years ago

MilesCranmer commented 2 years ago

PySR should be able to reuse parts of an equation.

In theory, a PySR equation (Node) type could have a single subtree used by multiple parts of the expression, as any two nodes can point to the same node.

# Node defines a symbolic expression stored in a binary tree.
# A single `Node` instance is one "node" of this tree, and
# has references to its children. By tracing through the children
# nodes, you can evaluate or print a given expression.
mutable struct Node
    degree::Int  # 0 for constant/variable, 1 for cos/sin, 2 for +/* etc.
    constant::Bool  # false if variable
    val::CONST_TYPE  # If is a constant, this stores the actual value
    # ------------------- (possibly undefined below)
    feature::Int  # If is a variable (e.g., x in cos(x)), this stores the feature index.
    op::Int  # If operator, this is the index of the operator in options.binary_operators, or options.unary_operators
    l::Node  # Left child node. Only defined for degree=1 or degree=2.
    r::Node  # Right child node. Only defined for degree=2. 
end

The l and r attributes point to child nodes. These could in principle point to a node that is also pointed to by another node.

However, is that the best way to reuse parts of an expression? Also, how would copying work, with the pointer structure?

Perhaps all that is needed is to add a "reconnection" mutation, and update the copying procedure to pay attention to re-used pointers.


(Any help appreciated, if someone is interested in this)

MilesCranmer commented 1 year ago

Perhaps a better way of doing this is not to try to reconnect nodes, but to change the definition of compute_complexity to reduce the complexity if a particular subtree is repeated? This may be a bit overkill, but seems like an elegant way to both reward the search for using repetitive patterns, while allowing each pattern to be tweaked if needed.

Might be related to Kolmogorov Complexity: https://en.wikipedia.org/wiki/Kolmogorov_complexity. Is there any Julia package that lets us compute Kolmogov complexity of a particular sequence?

MilesCranmer commented 1 year ago

This might be useful: https://github.com/simonschoelly/InformationDistances.jl

>>> using InformationDistances
>>> compressor = LibDeflateCompressor()
>>> compressed_length(compressor, "abababababab")
6
>>> compressed_length(compressor, "ababababababab")
6
>>> compressed_length(compressor, "abababacbababab")
... # Note the "c"
17
MilesCranmer commented 1 year ago

135 will get part way to solving this.

MilesCranmer commented 1 year ago

Evolutionary aspect

I think we need a few genetic operators (in a new PR):

Complexity calculations

Other utils

Crossovers

Evaluation

Printing

@Jgmedina95 would you be up for implementing any of these?

viktmar commented 1 year ago

Sorry if I haven't seen your respective version of the following code. But I couldn't find it, so I propose the following for the complexity. It counts reused nodes with the complexity of 1 and does not continue traversing

function count_nodes_unique(tree::Node; tree_list=Node[], compl=0)
    compl += 1
    if !(tree in tree_list)
        push!(tree_list, tree)
        if tree.degree == 1
            compl_l, tree_list = count_nodes_unique(tree.l, tree_list=tree_list)
            compl += compl_l
        elseif tree.degree == 2
            compl_l, tree_list = count_nodes_unique(tree.l, tree_list=tree_list)
            compl_r, tree_list = count_nodes_unique(tree.r, tree_list=tree_list)
            compl += compl_l + compl_r
        end
        return compl, tree_list
    end
    return compl, tree_list
end
MilesCranmer commented 1 year ago

Hi @viktmar, I think this may not work because in measures equality == rather than reference equality ===. The current version in the shared-node-utils branch solves this as follows:

function _count_nodes(tree::Node{T}, nodes_seen::ID)::Int where {T,ID}
    !(ID <: Nothing) && haskey(nodes_seen, tree) && return 0
    count = if tree.degree == 0
        1
    elseif tree.degree == 1
        1 + _count_nodes(tree.l, nodes_seen)
    else
        1 + _count_nodes(tree.l, nodes_seen) + _count_nodes(tree.r, nodes_seen)
    end
    !(ID <: Nothing) && (nodes_seen[tree] = true)
    return count
end

Where nodes_seen is an IdDict{Node,Bool}. I did try simply passing in a vector of UInt, and pushing the objectid, but I found IdDict is faster for some reason. Cheers, Miles

MilesCranmer commented 1 year ago

Actually I take that back, it seems like in is actually looking for === rather than ==. Perhaps that's a faster way to check than IdDict...?

MilesCranmer commented 1 year ago

It's working!!!

Here's an example (on the shared-node-utils branch)

using SymbolicRegression

X = rand(Float32, 2, 100)
# We will repeat this part:
y = (X[1, :] .- X[2, :] .+ 1)
# Sum of powers:
y = y .^ 1.5 .+ y .^ 2.1 .+ y .^ 0.8

# Subtract mean of y:
y = y .- mean(y)

options = SymbolicRegression.Options(;
    binary_operators=(+, -, ^),  # npopulations=20
    complexity_of_variables=3,
    maxsize=40,
    constraints=[(^) => (-1, 1)],
    mutationWeights=[0.048, 0.47, 0.79, 5.1, 1.7, 0.0020, 0.00023, 0.05, 0.05, 0.21],
    # The third to last and second to last are make random connection, and break random connection
)

hall_of_fame = EquationSearch(X, y; niterations=4000, options=options, multithreading=true)

This seems to correctly build x1 - x2 + 1 as the subexpression, and then use it in power laws! It does take a while, though, so it's probably best if there is a specific flag to turn on shared expressions or not.

Printed expressions use {} to indicate that a node is shared. Sometimes you will see {} inside {} - that means it is sharing a node which, in that tree, shares another node!


Side note: would be good to refactor the mutation weights so that it's not just a long vector of numbers... Probably best to have it as a separate struct.

MilesCranmer commented 1 year ago

cc @CharFox1 @johanbluecreek

MilesCranmer commented 1 year ago

The current search is so slow with this that I think some functions are breaking hierarchies somehow. I think combine_operators and simplify_equation might(?) be breaking things... Need to investigate.

I think one needs to be careful about moving nodes around if those nodes are referenced by something - since they could end up a parent to their own parent!

dan-blank commented 2 months ago

@MilesCranmer That is a really interesting idea! My only contact so far was during my thesis, where I built a translation from a proof tree in our proof language into a sequential proof in another language (SMTCoq). I simplified first (simple tree traversal, no caching), and then for the more complicated phases, build a table of shared sub expressions (maximally shared, everything that was used more than once got a variable), swapping out the concrete sub trees with a variable pointing to the correct entry in the table. Those variables where variables like any other. That worked well, but it was a much less dynamic situation (proof in, altered proof out) than genetic programming.

My thoughts about this are:

Hope that made sense! By the way, I would be interested in working on simplifications (that would be https://github.com/SymbolicML/DynamicExpressions.jl, right?), if you would be interested :)