Open lacava opened 2 years ago
Hi there,
I started a new implementation of symbolic regression, based on sympy (i.e using sympy at its core, not only to format outputs). At some point I had to define complexity. Here are my first thoughts
The basic implementation in sympy would be
from sympy import preorder_traversal
def complexity(expr):
return sum((1 for _ in preorder_traversal(expr))
That being said, it does not handle all cases. A good example is the protected division. The most simple way I could define it with sympy was
from sympy.abc import x, y
from sympy import Abs, S
pdiv = Piecewise((x / y, Abs(y) > 0.001), (S.One, True))
The problem with this (yet proper) implementation is that the complexity, as defined previously, would artificially blow up.
In cases where sympy can apply simplification, it can be reduced in x / y
, and we are happy with that. In other cases though, every internal operator inside this Piecewise expression accounts for 1 in the complexity.
A workaround, just to handle piecewise operators, is to replace them, like this
def complexity(expr, complexity_map={Piecewise: lambda e: e.args[0].args[0]}):
for op_type, accessor in complexity_map.items():
founds = expr.find(op_type)
expr = expr.subs(zip(founds, map(accessor, founds)))
return sum((1 for _ in preorder_traversal(expr)))
In the above example the protected div is literally replaced by the div. But this is a bit tricky, and prone to errors ...
Another workaround would be to define a new operator, like so
from sympy import Function, S
class pdiv(Function):
@classmethod
def eval(cls, x, y):
if y.is_real:
if Abs(y) > 0.001:
return x / y
else:
return S.One
This way we have a "best of breed", the protected div can be simplified to x / y
when possible, or just account for itself (pdiv
), with a complexity of 1
I hope this sheds some light on the problem of defining complexity. IMO letting developers define it is prone to have them underestimating the complexity of the solution they provide, which we cannot blame them for
that's actually exactly how we're defining complexity:
And i'm using new operators for things like protected log
The protected division might be good to add, but at the moment, divisions are just being converted. It's messy when it comes to idealized synthetic problems.
Good to see someone else come up with the same solutions. But the issue I mean to raise here is that methods don't all return sympy-compatible models, so we end up with a bunch of post-processing conversions. I want to get rid of those and push the requirements onto the methods.
At the moment a lot of preprocessing is done to convert the models returned by different methods into a common, sympy-compatible format in experiment/symbolic_utils.py.
I would like to remove this post-processing step and, in the future, require methods to return sympy compatible strings. Steps:
See updated contribution guide