Rewrite `Expression` class?

veni-vidi-vici-dormivi commented 2 days ago

I was thinking why we do not actually pass functions as arguments instead of a strings to the Expression class. I was thinking something like this:

import scipy
import numpy as np
from typing import Callable, Union, List, Tuple
import inspect
import scipy.stats

class Expression:
    def __init__(self, 
                 distribution: Union[scipy.stats.rv_continuous, scipy.stats.rv_discrete], 
                 parameters: List[Tuple[str, Callable]]):

        self.distribution = distribution
        self.parameters = parameters

    def sample(self, **kwargs):
        distribution_args = {}

        for param_name, func in self.parameters:
            # Get the function's signature to see what arguments it requires
            func_signature = inspect.signature(func)
            func_args = {k: v for k, v in kwargs.items() if k in func_signature.parameters}

            # Call the function with the matched arguments
            distribution_args[param_name] = func(**func_args)

        # Generate a sample using the distribution and the computed parameter values
        return self.distribution.rvs(**distribution_args)

# Example functions for loc and scale (and potentially other parameters)
def loc_func(c1: float, c2: float, x: float | np.ndarray):
    return c1 + c2 * x

def shape_func(c3: float):
    return c3

# Create parameter list as an iterable (list of tuples) for loc, scale, and others
param_list = [
    ('loc', loc_func),
    ('scale', shape_func)  # Example additional parameter
]

# Initialize Expression with scipy.stats.norm or any other distribution
myexpression = Expression(scipy.stats.norm, param_list)
x = np.array([0.5, 0.6, 0.7, 0.8, 0.9])

# Sample using the expression with arguments for loc, scale, and shape functions
sample = myexpression.sample(c1=1, c2=2, c3=1, x=x)
print(sample)

And then feeding the functions to the optimization routines like we would after parsing? This is not incredibly thought through yet but it might be a way around the parsing.

mathause commented 2 days ago

Thanks for getting this discussion started. For me it would be important to understand the requirements better & I try to do this by going over the code / writing tests. Some random thoughts and questions:

I had ideas in a similar direction & can imagine that passing functions could be faster and clearer. But we don't know the structure of the function.
We could get rid of the eval and exec calls which would make me happy :smile:
Do we need to know the structure of the functions?
Can it be that we want to have the same coeff for two params (e.g. loc=c1, scale=c1)?
How can we differentiate coeffs from covariates?
- Pass coeffs by position and covariates as kwargs?
Should coeffs be an array? That would align with the structure of the minimize function.
```
def loc_func(c, x):
   return c[0] + c[1] * x

def shape_func(c):
   return c[2]
```
Are discrete distributions actually an option?
Should we create a class for each distribution we support?
- I guess there are 'only' about 5-15 that are actually relevant.
- We could 'manually' pass the bounds https://github.com/MESMER-group/mesmer/blob/f9cd338d106b5975b41bc9ad235dbdd41248a374/mesmer/mesmer_x/train_utils_mesmerx.py#L180-L181
- Maybe this would help with setting the initial values? At least for some of the distributions.
Same comment for the covariates-parameter functions.

veni-vidi-vici-dormivi commented 1 day ago

A problem with going away from the expression as a string could be that it becomes harder to save the expression. I think it is important to save the expression in some way, only saving the parameters won't get you anywhere if you lost the expression, but we could handle that by still setting a name and asking the user to put all relevant information there. Then we could save the name as an attribute with the parameters.

Should coeffs be an array?

I like that idea, then we could get rid of the signature call and saving could also be more flexible.

Can it be that we want to have the same coeff for two params

That is possible for the structure I gave above, I'm just not sure how the optimization would handle that...

yquilcaille commented 1 day ago

I see the appeal of this solution, but I see there a major problem. In summary, it clashes with the objective of Expression, and future applications will be hindered by the need to provide all functions along.

The whole objective of the class Expression was to finally be capable of doing all distributions and all expressions on parameters. To have full flexibility.

With this alternative, the user will have to define its functions and pass them as arguments. We have to make sure that the same functions are used for training AND emulation. Thus pass the functions to the users. Thus have a script with all potential functions. Which is precisely what i wanted to avoid, being constrained to predefined functions. In my experience, there will always be new ideas, new expressions. So, every single time, we would have to add new functions to the script with all predefined functions. I sincerely fear that it will be messy to have them all.

So, we have to choose, do we pass directly the functions and lose the full flexibility, or do we stick to the strings, with the fix on the parsing? My personal opinion is for the full flexibility, because of how much time i've spent in this code, because of future developments & applications. But again, I see the appeal of your solution.

One extra point: the object returned by the alternative class is the .rvs object. It should be a scipy,.stats distribution, so that we can call anything. I've already to call logpdf, logpmf, cdf, sf, isf & support, and for plots, even more.

On the other points raised by @mathause:

exec & eval: yes, required with the original Expression, but unless i'm mistaken, they are used correctly here. So even if these functions are dangerous, it should be fine. Hopefully?
what do you mean by structure of the function?
Same coefficients in multiple parameters? Yes. I suspect that Lorenzo will need that for extreme precipitations given the expressions used in Extreme Event Attribution.
Yes, discrete distributions are already used for the number of days with extreme fire weather and for the length of the fire season.
Creating class for each distribution? I'm strongly against that. Again, the very purpose of this class is full flexibility. It will create codes with lots of redundancies, too long for nothing. And every new distribution will require additional developments.
Setting bounds on parameters, yes, but already done as much as i think we can. There are already some defaults for the scale. Nothing on the location, because not bounded. The shape is less evident, because it depends very much on each distribution. Then setting bounds using the data, not as trivial, because the mean & spread are expressions of all parameters that vary from an expression to another. This is why the first guess is so tricky to get. Then, setting bounds on coefficients isnt straightforward, i investigated that for global optimizations, I couldnt find a proper solution.

Of course, it requires some understanding of the string as input, with the parsing failing with extra commas (but potentially easy fix). It also requires

MESMER-group / mesmer

Rewrite `Expression` class? #527