MilesCranmer / PySR

High-Performance Symbolic Regression in Python and Julia
https://astroautomata.com/PySR
Apache License 2.0
2.19k stars 207 forks source link

[Feature] Thresholding numerical constants #151

Open iamcalvinlau opened 2 years ago

iamcalvinlau commented 2 years ago

First off, thanks for a really great package! I'm definitely a fan of this for many reasons, but a nice transparent package to suggest an interpretable model instead of a blackbox? Instant fan!

Is your feature request related to a problem? Please describe. In my current attempts to use PySR, I mostly use inputs which are already dimensionless so that numerical values in models should probably near unity (unless there's an interesting factor I've missed, which is always a possibility).

For example, in a particular test case, I am finding a suggested model of

'2.5387976 \cdot 10^{-9} x{1} + \frac{0.00049017137 x{1}}{x_{0}}'

whereas the correct model would have been

'\frac{0.0005 x{1}}{x{0}}'

I understand that I can always take this suggested model and do optimization to find the coefficients a bit more clearly after the fact.

Describe the solution you'd like A more preferable approach would be the ability to add conditions for constants so that they are dropped out by thresholding. More specifically, I'm imagining an option to add a threshold for the magnitude of the constants so that constants which are below the threshold are dropped or zero-ed out after each iteration (or cycle per iteration?).

In the test above, then the term with the constant '2.5387976 \cdot 10^{-9}' would get dropped and the resulting suggested model would look like

' \frac{0.00049017137 x{1}}{x{0}}'

Additional context I thinking about this in the way that SINDy and its variants have something like sequential thresholding, or in these cases where a user might have already pre-processed the data so that a model which is probably correct would have constants near unity.

Thanks in advance!

MilesCranmer commented 2 years ago

Hi @iamcalvinlau,

Thanks for the interesting question, and happy to hear you are enjoying the package! This is not built-in, but here is one suggestion. This line of the backend code: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/a9e696d513a791f48b18389f5e23830d90d32414/src/EvaluateEquation.jl#L145-L149 evaluates a leaf node and returns an array containing the value. This leaf node could be a constant (if tree.constant) or variable.

You could basically tweak the constant line to constrain the constant value to be close to unity. For example:

    if tree.constant
        original_constant = tree.val
        adjusted_constant = 1 + 0.1 * tanh(original_constant)
        return (fill(convert(T, adjusted_constant), n), true)
    else
        return (cX[tree.feature, :], true)
    end

So, I have made it so that the constant is always between 0.9 and 1.1, by passing it through the tanh function (which is built-in).

Then, give PySR the argument julia_project="<location of .../SymbolicRegression.jl/ > to point to your custom backend.

You also need to modify this line: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/a9e696d513a791f48b18389f5e23830d90d32414/src/Equation.jl#L155 with the same transform to tree.val, so that it gives the correct equation back to PySR (and prints it correctly!).

Cheers, Miles

iamcalvinlau commented 2 years ago

Hi Miles,

Thanks for the suggestions!

I ended up doing something like this (tied to parsimony because I was lazy being efficient, and shown here in case anyone wants to see the ugly way I did it)

(EvaluateEquation.jl)

        n_sigdigits = round(Int, 1-log10(options.parsimony))
        adjusted_constant = round(original_constant, sigdigits=n_sigdigits)
        threshold = options.parsimony * 1e-2
        if( abs(original_constant) < threshold )
                adjusted_constant = 0.0
        end
        return (fill(convert(T, adjusted_constant), n), true)

(Equation.jl)

            original_constant = tree.val
            n_sigdigits = round(Int, 1-log10(options.parsimony))
            adjusted_constant = round(original_constant, sigdigits=n_sigdigits)
            threshold = options.parsimony * 1e-2
            if( abs(original_constant) < threshold )
                adjusted_constant = 0.0
            end
            return string(adjusted_constant)

In the end, I was getting nicely printed equations of the right form, but I also realized that it was counting the complexity of the terms with coefficients of zero value.

More specifically, I was finding equations of the form

'0.0 x{1} + \frac{0.000490 x{1}}{x_{0}}'

but it was still counting that first term on the left term as part of the overall complexity.

Sorry for the add-on question, but do you have a suggestion on how to make the complexity calculation not take those terms into account?

Thanks! --Calvin

MilesCranmer commented 2 years ago

Hi Calvin,

Great that you got it to work!

Regarding not counting 0's, here are two hacks:

  1. Complexity is computed by this function: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/a9e696d513a791f48b18389f5e23830d90d32414/src/EquationUtils.jl#L76. You want to tweak it to not count a constant if it is 0 by changing this line: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/a9e696d513a791f48b18389f5e23830d90d32414/src/EquationUtils.jl#L89 to return 0 if tree.val (or your processed version of tree.val) is 0.
  2. You could add something in simplify_tree (https://github.com/MilesCranmer/SymbolicRegression.jl/blob/a9e696d513a791f48b18389f5e23830d90d32414/src/SimplifyEquation.jl#L96) to automatically delete part of a tree if it is 0.0 multiplied by something. This requires a bit of work though.

However, in general, if you run it long enough and with enough populations, it should automatically get rid of terms which don't improve the accuracy of an expression. So you shouldn't in principle need either of these options.

Cheers, Miles

MilesCranmer commented 2 years ago

(options.complexity_mapping.constant_complexity is by default 1, but you could re-define it with complexity_of_constants=... passed to PySR)