sympy-compatibility of final model strings

lacava commented 2 years ago

At the moment a lot of preprocessing is done to convert the models returned by different methods into a common, sympy-compatible format in experiment/symbolic_utils.py.

I would like to remove this post-processing step and, in the future, require methods to return sympy compatible strings. Steps:

Move centralized model cleaning to the individual methods
Have method developers update their codebases to support sympy return strings

See updated contribution guide

remiadon commented 2 years ago

Hi there,

I started a new implementation of symbolic regression, based on sympy (i.e using sympy at its core, not only to format outputs). At some point I had to define complexity. Here are my first thoughts

The basic implementation in sympy would be

from sympy import preorder_traversal
def complexity(expr):
    return sum((1 for _ in preorder_traversal(expr))

That being said, it does not handle all cases. A good example is the protected division. The most simple way I could define it with sympy was

from sympy.abc import x, y
from sympy import Abs, S
pdiv = Piecewise((x / y, Abs(y) > 0.001), (S.One, True))

The problem with this (yet proper) implementation is that the complexity, as defined previously, would artificially blow up. In cases where sympy can apply simplification, it can be reduced in x / y, and we are happy with that. In other cases though, every internal operator inside this Piecewise expression accounts for 1 in the complexity.

A workaround, just to handle piecewise operators, is to replace them, like this

def complexity(expr, complexity_map={Piecewise: lambda e: e.args[0].args[0]}):
    for op_type, accessor in complexity_map.items(): 
        founds = expr.find(op_type)
        expr = expr.subs(zip(founds, map(accessor, founds)))
    return sum((1 for _ in preorder_traversal(expr)))

In the above example the protected div is literally replaced by the div. But this is a bit tricky, and prone to errors ...

Another workaround would be to define a new operator, like so

from sympy import Function, S
class pdiv(Function):
    @classmethod
    def eval(cls, x, y):
        if y.is_real:
            if Abs(y) > 0.001:
                return x / y
            else:
                return S.One

This way we have a "best of breed", the protected div can be simplified to x / y when possible, or just account for itself (pdiv), with a complexity of 1

I hope this sheds some light on the problem of defining complexity. IMO letting developers define it is prone to have them underestimating the complexity of the solution they provide, which we cannot blame them for

lacava commented 2 years ago

that's actually exactly how we're defining complexity:

https://github.com/cavalab/srbench/blob/e5ded4715ed5721703353d3500a2fdb99004faf1/postprocessing/symbolic_utils.py#L12-L16

And i'm using new operators for things like protected log

https://github.com/cavalab/srbench/blob/e5ded4715ed5721703353d3500a2fdb99004faf1/experiment/symbolic_utils.py#L42-L49

The protected division might be good to add, but at the moment, divisions are just being converted. It's messy when it comes to idealized synthetic problems.

Good to see someone else come up with the same solutions. But the issue I mean to raise here is that methods don't all return sympy-compatible models, so we end up with a bunch of post-processing conversions. I want to get rid of those and push the requirements onto the methods.

cavalab / srbench

sympy-compatibility of final model strings #58