MilesCranmer / SymbolicRegression.jl

Distributed High-Performance Symbolic Regression in Julia
https://astroautomata.com/SymbolicRegression.jl/dev/
Apache License 2.0
611 stars 77 forks source link

(Non-)negative custom loss functions #81

Open marknovak opened 2 years ago

marknovak commented 2 years ago

Please forgive me if I'm missing something basic, but is there a reason that custom loss functions aren't allowed to go negative? I'm using PySR 0.7.9.

RuntimeError: <PyCall.jlwrap (in a Julia function called from Python)
JULIA: DomainError with -5.748968e30:
Your loss function must be non-negative.

Is there something specific in the context of PySR that prevents the use of negative loss functions? My question stems from wanting to use a Poisson loss (e.g., deviance "loss(x, y) = 2 * (x * log(x / y) + y - x)" or likelihood loss(x, y) = y - x * log(y)). These and the like seem to be used in other ML contexts too (e.g., https://www.tensorflow.org/api_docs/python/tf/keras/losses/Poisson and https://github.com/JuliaML/LossFunctions.jl/).

(I presume it must have something to do with the choice of the underlying optimizer that's being used. If that choice is inflexible, then perhaps it would be useful to explain the requirement in the docs regarding custom loss functions, which I see from other Issue responses are a work-in-progress.)

Thanks!

MilesCranmer commented 2 years ago

The reason loss functions must be positive is because the logarithm of the loss is taken when computing the accuracy/complexity tradeoff. This metric computing the accuracy/complexity tradeoff was originally defined assuming mean-square error loss, but it is also used for any other loss function, and is equal to -d(log(loss))/d(complexity) (i.e., the maximal decrease in log(loss) over an increase in complexity). This metric doesn't have any rigorous theory behind it, it is simply a metric commonly used in SR (I first saw it in Schmidt+Lipson 2009, but it might have been around earlier) for finding the "true" equation from a list of most accurate equations at each complexity. However, because this is log(loss), the loss must be positive.

This metric doesn't actually affect the search itself, so if you desire a negative loss, maybe you could wrap it with an exponential? i.e., "loss(x, y) = exp(2 * (x * log(x / y) + y - x))". Unless you are using annealing, any monotonic transform applied to the loss will not affect the search in any way.

Yes, apologies for the lack of documentation on this particular part of the library!

MilesCranmer commented 2 years ago

In the meantime I have added a better error message on this, suggesting that the user wrap their loss function with exp.

marknovak commented 2 years ago

Perfect. Thanks! (And no need for apologies!)

MilesCranmer commented 2 years ago

At a meta level, I just realized how ironic it is that I use the empirical expression -d(log(loss)) / d(complexity) based on it being commonly used, rather than learning such a metric from scratch (using, say, many different example expressions) with a genetic algorithm!

marknovak commented 2 years ago

I've had the similar thoughts in regards to the loss function itself. For example, it's becoming standard in my subfield to assume a Poisson likelihood for count data (hence my desire to use it as a loss function) rather than deal with having to estimate an additional (nuisance) parameter allowing for over/under-dispersion (as in the negative binomial). I've not seen SR applications that enable the learning of the loss function itself -- and perhaps there's too much circularity involved for it to even be possible -- but it would be pretty cool if one could do both by specifying only the desire to describe the conditional mean of the data, for example.

MilesCranmer commented 2 years ago

Technically simulation-based inference sort of does this - you learn the likelihood such that your model optimally describes the joint distribution p(x, y) over your data in a very flexible way, rather than assuming a likelihood, and optimizing p(y|x). You can use a normalizing flow to learn a data-driven likelihood, perhaps use symbolic regression to fit that likelihood, and then use that analytic likelihood as your loss function...?

Here's a nice example from some particle physics people where they use PySR to find analytic versions of a learned compression in a simulation-based inference pipeline: https://arxiv.org/abs/2109.10414. I don't know if they use PySR on the normalizing flow itself though, maybe just the conditioning term to it.