MilesCranmer / SymbolicRegression.jl

Distributed High-Performance Symbolic Regression in Julia
https://astroautomata.com/SymbolicRegression.jl/dev/
Apache License 2.0
593 stars 77 forks source link

Multidimensional equations #262

Open cmhamel opened 11 months ago

cmhamel commented 11 months ago

Is there a way to optimize a multi dimensional symbolic equation?

From I can tell from the documentation and toying with the package, each output is given its own symbolic form. Is this correct?

MilesCranmer commented 11 months ago

Not currently. However, DynamicExpressions.jl which forms the expression backend, can indeed handle this: https://github.com/SymbolicML/DynamicExpressions.jl/#tensors. So just need to find some time to try turning it on and fixing up various type assumptions.

Alternatively you can implement this manually via a custom loss objective, where the objective splits a single expression into each component of the vector output. See https://astroautomata.com/PySR/examples/#9-custom-objectives for an example (That shows the Python API but its the same on the Julia side. Just convert full_objective -> loss_function and pass a function rather than a string)

cmhamel commented 11 months ago

Thanks @MilesCranmer !

The custom loss is probably what I need. To make sure I understand, let's say I have a 3d equation I'm trying to fit. Would I just need to ensure it's a binary tree, split the root, and then say split the left node of the root again to fill out three expressions?

MilesCranmer commented 11 months ago

Yeah, exactly!!

Another tricky part comes from the fact that Dataset.y is a 1D vector. Thus, you could put the 1st element into y, and the 2nd and 3rd element into the last columns of X. Then, in your loss function, you could have a check that those features never show up in the expression, like this:

function my_custom_objective(tree, dataset::Dataset{T,L}, options) where {T,L}
    # Return infinite loss for any violated assumptions:
    tree.degree != 2 && return L(Inf)
    tree.l.degree != 2 && return L(Inf)

    # say, 2nd element is feature 6 in X, and 3rd is feature 7.
    # This function checks if a given node is equal to those feature nodes:
    is_feature_6_or_7(node) = node.degree == 0 && !node.constant && (node.feature == 6 || node.feature == 7)

    # We iterate through all the nodes in the tree. If any match, we return infinite loss:
    any(is_feature_6_or_7, tree) && return L(Inf)

    y1 = dataset.y
    y2 = dataset.X[6, :]
    y3 = dataset.X[7, :]
    # [rest of loss function]
end

Also note that you will have to manually extract the subexpressions at the very end (since the printing does not know about your scheme).

MilesCranmer commented 11 months ago

Also, one other thought – returning Inf might be too harsh. So what you could do is return L(10000) if tree.degree != 2, but then only L(1000) (i.e., 10x lower) if tree.l.degree != 2. And L(100) for the feature violation. So at least then you are telling the genetic algorithm to go in the right direction (otherwise it might never create a tree with a binary node -> binary node, and just get stuck)