brencej / ProGED

equation discovery based on generative models
BSD 3-Clause "New" or "Revised" License
15 stars 5 forks source link

Example for ODE systems #69

Open soerenab opened 1 year ago

soerenab commented 1 year ago

Hi, thanks again for open sourcing your repository. Could you provide a simple example on how to apply your model for discovery of ODE systems (e.g. 2 variables). I had a look at the lorenz.py example but I get the following error:

ImportError: cannot import name 'DE_fit' from 'ProGED.parameter_estimation'

(I am also guessing that the code has changed since the lorenz.py example was added.)

My fundamental question is the following: When I have a system of equations, do I need to run fit() multiple times, once per component of the system (as specified by target_variable_index). If so, how do I deal with the fact that the equation discoverer usually outputs many candidates (=sample_size), i.e., I would get multiple candidates per component. What's the best way to combine the candidate of different components, is it just the cartesian product?

Thanks a lot for you help!

brencej commented 1 year ago

Hi!

We have recently changed the way ProGED is set up and have not yet updated all the examples and documentation. Most importantly, a pandas dataframes is expected, with column names matching variable names. Furthermore, rhs_vars and lhs_vars expect lists with variable names. Time should be called either "t" or "time".

We also built in better support of systems of equations. Now, Model represents a system of equations, represented as a list of expressions. Parameter estimation always fits all the components in the system. The generator generates each component independently by default. You could change this behavior by manually calling ProGED.generate.generate_models with system_size=1 for each component, and then composing the system yourself, for example, through cartesian product. Which approach is better is unclear at the moment.

The most comprehensive source of examples currently is test/test_core.py. (Note that at the moment, some tests will throw assertion errors for "wrong" results, but in fact, the code is running fine). Since there is no test for 2D ODEs with generation and estimation, I wrote an example below.

import numpy as np
import pandas as pd
import ProGED as pg

np.random.seed(2)
# Generate and prepare data
t = np.arange(0, 1, 0.1)
x = 3*np.exp(-2*t)
y = 5.1*np.exp(-1*t)
data = pd.DataFrame(np.vstack((t, x, y)).T, columns=['t', 'x', 'y'])

# Set up equation discovery
ED = pg.EqDisco(data = data,
                task_type="differential", #either algebraic or differential
                lhs_vars = ["x", "y"], # the variables on the left-hand side, the number should match system_size
                rhs_vars = ["t", "x", "y"], # the variables that can appear on the right-hand side
                sample_size = 3, # how many random candidate systems to generate
                system_size = 2, # how many equations in the system of equations
                generator_template_name = "polynomial", # the grammar to use; you can write your own
                strategy_settings = {"max_repeat": 100}) # gives more attempts to generator, ensuring we get 3

print("Generating models...")
ED.generate_models()
print(ED.models)

print("Estimating parameters...")
ED.fit_models()

print("The best discovered model:")
print(ED.get_results())

I hope this answers your questions. If you have any further issues, please do not hesitate to ask!

soerenab commented 1 year ago

Thanks for your quick answer! In case anyone else stumbles upon this, add these imports, then the script runs:

import numpy as np
import pandas as pd
from ProGED.equation_discoverer import EqDisco

(you will need to change pg.EqDisco to just EqDisco).

brencej commented 1 year ago

Thanks for catching the lack of imports! For the sake of completness, I have edited my answer to include them.

soerenab commented 1 year ago

I have a follow up question / suggestions:

I slightly modified your script (increase sample size, different random seed, other data)

import numpy as np
import pandas as pd
from ProGED.equation_discoverer import EqDisco

np.random.seed(123)
# Generate and prepare data
t = np.linspace(1, 2, 128)
x = 0.1*np.exp(0.1*t)
y = 0.2*np.exp(0.2*t)
data = pd.DataFrame(np.vstack((t, x, y)).T, columns=['t', 'x', 'y'])

# Set up equation discovery
ED = EqDisco(
    data = data,
    task_type="differential", #either algebraic or differential
    lhs_vars = ["x", "y"], # the variables on the left-hand side, the number should match system_size
    rhs_vars = ["t", "x", "y"], # the variables that can appear on the right-hand side
    sample_size = 5, # how many random candidate systems to generate
    system_size = 2, # how many equations in the system of equations
    generator_template_name = "polynomial", # the grammar to use; you can write your own
    strategy_settings = {"max_repeat": 100}
) # gives more attempts to generator, ensuring we get 3

print("Generating models...")
ED.generate_models()
print(ED.models)

print("Estimating parameters...")
ED.fit_models()

print("The best discovered model:")
print(ED.get_results())

And this now raises and error:

FloatingPointError: overflow encountered in square

If you reduce the sample_size to 4, the error disappears - therefore I guess that if parameter estimation in the second stage fails for even a single candidate, all candidate fail / no valid candidate is returned. It would probably be best to wrap estimation per candidate in try...except to still be able to return other potentially valid candidates.

soerenab commented 1 year ago

And in some cases (but I unfortunately have no way to reproduce this right now), I also get the following error:

ValueError: Left and right hand side variables are the same. This should not happen in algebraic models.

Here is the error trace, starting from the script's output:

Generating models...
[C0*x_1, C1*x_0 + C2]
[C0*exp(C1*t**2*x_0*x_1), C2*t]
Estimating parameters...
  File "/p/project/hai_microbio/sb/repos/updated/ProGED/ProGED/equation_discoverer.py", line 202, in fit_models
    self.models = fit_models(self.models, self.task.data,
  File "/p/project/hai_microbio/sb/repos/updated/ProGED/ProGED/parameter_estimation.py", line 43, in fit_models
    fitted = list(pool_map(estimator.fit_one, models.values()))
  File "/p/project/hai_microbio/sb/repos/updated/ProGED/ProGED/parameter_estimation.py", line 131, in fit_one
    result = optimizer(self, model)
  File "/p/project/hai_microbio/sb/repos/updated/ProGED/ProGED/parameter_estimation.py", line 278, in local_fit
    output = sp_minimize(f, np.random.uniform(bounds[0], bounds[1], size=len(model.params)))
  File "/p/project/hai_microbio/sb/miniconda3/envs/symbolicregression39/lib/python3.9/site-packages/scipy/optimize/_minimize.py", line 691, in minimize
    res = _minimize_bfgs(fun, x0, args, jac, callback, **options)
  File "/p/project/hai_microbio/sb/miniconda3/envs/symbolicregression39/lib/python3.9/site-packages/scipy/optimize/_optimize.py", line 1362, in _minimize_bfgs
    sf = _prepare_scalar_function(fun, x0, jac, args=args, epsilon=eps,
  File "/p/project/hai_microbio/sb/miniconda3/envs/symbolicregression39/lib/python3.9/site-packages/scipy/optimize/_optimize.py", line 332, in _prepare_scalar_function
    sf = ScalarFunction(fun, x0, args, grad, hess,
  File "/p/project/hai_microbio/sb/miniconda3/envs/symbolicregression39/lib/python3.9/site-packages/scipy/optimize/_differentiable_functions.py", line 158, in __init__
    self._update_fun()
  File "/p/project/hai_microbio/sb/miniconda3/envs/symbolicregression39/lib/python3.9/site-packages/scipy/optimize/_differentiable_functions.py", line 251, in _update_fun
    self._update_fun_impl()
  File "/p/project/hai_microbio/sb/miniconda3/envs/symbolicregression39/lib/python3.9/site-packages/scipy/optimize/_differentiable_functions.py", line 155, in update_fun
    self.f = fun_wrapped(self.x)
  File "/p/project/hai_microbio/sb/miniconda3/envs/symbolicregression39/lib/python3.9/site-packages/scipy/optimize/_differentiable_functions.py", line 137, in fun_wrapped
    fx = fun(np.copy(x), *args)
  File "/p/project/hai_microbio/sb/repos/updated/ProGED/ProGED/parameter_estimation.py", line 276, in f
    return estimator.settings["parameter_estimation"]["objective_function"](params, model, estimator)
  File "/p/project/hai_microbio/sb/repos/updated/ProGED/ProGED/parameter_estimation.py", line 303, in objective_algebraic
    raise ValueError("Left and right hand side variables are the same. This should not happen in algebraic models.")
ValueError: Left and right hand side variables are the same. This should not happen in algebraic models.
soerenab commented 1 year ago

Actually, the above error can be reproduced with minimal modifications to your script - just remove the y-variable and adjust the system_size

import numpy as np
import pandas as pd
import ProGED as pg

np.random.seed(2)
# Generate and prepare data
t = np.arange(0, 1, 0.1)
x = 3*np.exp(-2*t)
# y = 5.1*np.exp(-1*t)
# data = pd.DataFrame(np.vstack((t, x, y)).T, columns=['t', 'x', 'y'])
data = pd.DataFrame(np.vstack((t, x)).T, columns=['t', 'x'])

# Set up equation discovery
ED = pg.EqDisco(data = data,
                task_type="differential", #either algebraic or differential
                lhs_vars = ["x"], # the variables on the left-hand side, the number should match system_size
                rhs_vars = ["t", "x"], # the variables that can appear on the right-hand side
                sample_size = 3, # how many random candidate systems to generate
                system_size = 1, # how many equations in the system of equations
                generator_template_name = "polynomial", # the grammar to use; you can write your own
                strategy_settings = {"max_repeat": 100}) # gives more attempts to generator, ensuring we get 3

print("Generating models...")
ED.generate_models()
print(ED.models)

print("Estimating parameters...")
ED.fit_models()

print("The best discovered model:")
print(ED.get_results())

This raises

ValueError: Left and right hand side variables are the same. This should not happen in algebraic models.
brencej commented 1 year ago

Hi!

Regarding your first issue (FloatingPointError: overflow encountered in square), this is strange. The program already does the wrapping you suggest, and should be outputting only warnings for stuff like overflows or root of a negative number. When I try your code, I get a few warnings, but the code completes. Perhaps this could be due to a different package versions (numpy, sympy, etc.)?

The second issue turned out to be a bug, related to our recent introduction of config files - the program was not registring your setting of task_type="differential", but kept the default "algebraic". The config files require more fixing, but this specific issue should be resolved now. In my tests, the code works for all combinations of x,y on the LHS and x,y,t on the RHS.

Thanks for helping us improve the program!

soerenab commented 1 year ago

Thanks for your help!