MESMER-group / mesmer

spatially-resolved ESM-specific multi-scenario initial-condition ensemble emulator
https://mesmer-emulator.readthedocs.io/en/latest/
GNU General Public License v3.0
23 stars 17 forks source link

Rewrite `Expression` class? #527

Open veni-vidi-vici-dormivi opened 2 days ago

veni-vidi-vici-dormivi commented 2 days ago

I was thinking why we do not actually pass functions as arguments instead of a strings to the Expression class. I was thinking something like this:

import scipy
import numpy as np
from typing import Callable, Union, List, Tuple
import inspect
import scipy.stats

class Expression:
    def __init__(self, 
                 distribution: Union[scipy.stats.rv_continuous, scipy.stats.rv_discrete], 
                 parameters: List[Tuple[str, Callable]]):

        self.distribution = distribution
        self.parameters = parameters

    def sample(self, **kwargs):
        distribution_args = {}

        for param_name, func in self.parameters:
            # Get the function's signature to see what arguments it requires
            func_signature = inspect.signature(func)
            func_args = {k: v for k, v in kwargs.items() if k in func_signature.parameters}

            # Call the function with the matched arguments
            distribution_args[param_name] = func(**func_args)

        # Generate a sample using the distribution and the computed parameter values
        return self.distribution.rvs(**distribution_args)

# Example functions for loc and scale (and potentially other parameters)
def loc_func(c1: float, c2: float, x: float | np.ndarray):
    return c1 + c2 * x

def shape_func(c3: float):
    return c3

# Create parameter list as an iterable (list of tuples) for loc, scale, and others
param_list = [
    ('loc', loc_func),
    ('scale', shape_func)  # Example additional parameter
]

# Initialize Expression with scipy.stats.norm or any other distribution
myexpression = Expression(scipy.stats.norm, param_list)
x = np.array([0.5, 0.6, 0.7, 0.8, 0.9])

# Sample using the expression with arguments for loc, scale, and shape functions
sample = myexpression.sample(c1=1, c2=2, c3=1, x=x)
print(sample)

And then feeding the functions to the optimization routines like we would after parsing? This is not incredibly thought through yet but it might be a way around the parsing.

mathause commented 2 days ago

Thanks for getting this discussion started. For me it would be important to understand the requirements better & I try to do this by going over the code / writing tests. Some random thoughts and questions:

veni-vidi-vici-dormivi commented 1 day ago

A problem with going away from the expression as a string could be that it becomes harder to save the expression. I think it is important to save the expression in some way, only saving the parameters won't get you anywhere if you lost the expression, but we could handle that by still setting a name and asking the user to put all relevant information there. Then we could save the name as an attribute with the parameters.

Should coeffs be an array?

I like that idea, then we could get rid of the signature call and saving could also be more flexible.

Can it be that we want to have the same coeff for two params

That is possible for the structure I gave above, I'm just not sure how the optimization would handle that...

yquilcaille commented 1 day ago

I see the appeal of this solution, but I see there a major problem. In summary, it clashes with the objective of Expression, and future applications will be hindered by the need to provide all functions along.

The whole objective of the class Expression was to finally be capable of doing all distributions and all expressions on parameters. To have full flexibility.

With this alternative, the user will have to define its functions and pass them as arguments. We have to make sure that the same functions are used for training AND emulation. Thus pass the functions to the users. Thus have a script with all potential functions. Which is precisely what i wanted to avoid, being constrained to predefined functions. In my experience, there will always be new ideas, new expressions. So, every single time, we would have to add new functions to the script with all predefined functions. I sincerely fear that it will be messy to have them all.

So, we have to choose, do we pass directly the functions and lose the full flexibility, or do we stick to the strings, with the fix on the parsing? My personal opinion is for the full flexibility, because of how much time i've spent in this code, because of future developments & applications. But again, I see the appeal of your solution.

One extra point: the object returned by the alternative class is the .rvs object. It should be a scipy,.stats distribution, so that we can call anything. I've already to call logpdf, logpmf, cdf, sf, isf & support, and for plots, even more.

On the other points raised by @mathause:

Of course, it requires some understanding of the string as input, with the parsing failing with extra commas (but potentially easy fix). It also requires