Open soerenab opened 1 year ago
Hi!
We have recently changed the way ProGED is set up and have not yet updated all the examples and documentation. Most importantly, a pandas dataframes is expected, with column names matching variable names. Furthermore, rhs_vars and lhs_vars expect lists with variable names. Time should be called either "t" or "time".
We also built in better support of systems of equations. Now, Model represents a system of equations, represented as a list of expressions. Parameter estimation always fits all the components in the system. The generator generates each component independently by default. You could change this behavior by manually calling ProGED.generate.generate_models with system_size=1 for each component, and then composing the system yourself, for example, through cartesian product. Which approach is better is unclear at the moment.
The most comprehensive source of examples currently is test/test_core.py. (Note that at the moment, some tests will throw assertion errors for "wrong" results, but in fact, the code is running fine). Since there is no test for 2D ODEs with generation and estimation, I wrote an example below.
import numpy as np
import pandas as pd
import ProGED as pg
np.random.seed(2)
# Generate and prepare data
t = np.arange(0, 1, 0.1)
x = 3*np.exp(-2*t)
y = 5.1*np.exp(-1*t)
data = pd.DataFrame(np.vstack((t, x, y)).T, columns=['t', 'x', 'y'])
# Set up equation discovery
ED = pg.EqDisco(data = data,
task_type="differential", #either algebraic or differential
lhs_vars = ["x", "y"], # the variables on the left-hand side, the number should match system_size
rhs_vars = ["t", "x", "y"], # the variables that can appear on the right-hand side
sample_size = 3, # how many random candidate systems to generate
system_size = 2, # how many equations in the system of equations
generator_template_name = "polynomial", # the grammar to use; you can write your own
strategy_settings = {"max_repeat": 100}) # gives more attempts to generator, ensuring we get 3
print("Generating models...")
ED.generate_models()
print(ED.models)
print("Estimating parameters...")
ED.fit_models()
print("The best discovered model:")
print(ED.get_results())
I hope this answers your questions. If you have any further issues, please do not hesitate to ask!
Thanks for your quick answer! In case anyone else stumbles upon this, add these imports, then the script runs:
import numpy as np
import pandas as pd
from ProGED.equation_discoverer import EqDisco
(you will need to change pg.EqDisco
to just EqDisco
).
Thanks for catching the lack of imports! For the sake of completness, I have edited my answer to include them.
I have a follow up question / suggestions:
I slightly modified your script (increase sample size, different random seed, other data)
import numpy as np
import pandas as pd
from ProGED.equation_discoverer import EqDisco
np.random.seed(123)
# Generate and prepare data
t = np.linspace(1, 2, 128)
x = 0.1*np.exp(0.1*t)
y = 0.2*np.exp(0.2*t)
data = pd.DataFrame(np.vstack((t, x, y)).T, columns=['t', 'x', 'y'])
# Set up equation discovery
ED = EqDisco(
data = data,
task_type="differential", #either algebraic or differential
lhs_vars = ["x", "y"], # the variables on the left-hand side, the number should match system_size
rhs_vars = ["t", "x", "y"], # the variables that can appear on the right-hand side
sample_size = 5, # how many random candidate systems to generate
system_size = 2, # how many equations in the system of equations
generator_template_name = "polynomial", # the grammar to use; you can write your own
strategy_settings = {"max_repeat": 100}
) # gives more attempts to generator, ensuring we get 3
print("Generating models...")
ED.generate_models()
print(ED.models)
print("Estimating parameters...")
ED.fit_models()
print("The best discovered model:")
print(ED.get_results())
And this now raises and error:
FloatingPointError: overflow encountered in square
If you reduce the sample_size
to 4, the error disappears - therefore I guess that if parameter estimation in the second stage fails for even a single candidate, all candidate fail / no valid candidate is returned. It would probably be best to wrap estimation per candidate in try...except to still be able to return other potentially valid candidates.
And in some cases (but I unfortunately have no way to reproduce this right now), I also get the following error:
ValueError: Left and right hand side variables are the same. This should not happen in algebraic models.
Here is the error trace, starting from the script's output:
Generating models...
[C0*x_1, C1*x_0 + C2]
[C0*exp(C1*t**2*x_0*x_1), C2*t]
Estimating parameters...
File "/p/project/hai_microbio/sb/repos/updated/ProGED/ProGED/equation_discoverer.py", line 202, in fit_models
self.models = fit_models(self.models, self.task.data,
File "/p/project/hai_microbio/sb/repos/updated/ProGED/ProGED/parameter_estimation.py", line 43, in fit_models
fitted = list(pool_map(estimator.fit_one, models.values()))
File "/p/project/hai_microbio/sb/repos/updated/ProGED/ProGED/parameter_estimation.py", line 131, in fit_one
result = optimizer(self, model)
File "/p/project/hai_microbio/sb/repos/updated/ProGED/ProGED/parameter_estimation.py", line 278, in local_fit
output = sp_minimize(f, np.random.uniform(bounds[0], bounds[1], size=len(model.params)))
File "/p/project/hai_microbio/sb/miniconda3/envs/symbolicregression39/lib/python3.9/site-packages/scipy/optimize/_minimize.py", line 691, in minimize
res = _minimize_bfgs(fun, x0, args, jac, callback, **options)
File "/p/project/hai_microbio/sb/miniconda3/envs/symbolicregression39/lib/python3.9/site-packages/scipy/optimize/_optimize.py", line 1362, in _minimize_bfgs
sf = _prepare_scalar_function(fun, x0, jac, args=args, epsilon=eps,
File "/p/project/hai_microbio/sb/miniconda3/envs/symbolicregression39/lib/python3.9/site-packages/scipy/optimize/_optimize.py", line 332, in _prepare_scalar_function
sf = ScalarFunction(fun, x0, args, grad, hess,
File "/p/project/hai_microbio/sb/miniconda3/envs/symbolicregression39/lib/python3.9/site-packages/scipy/optimize/_differentiable_functions.py", line 158, in __init__
self._update_fun()
File "/p/project/hai_microbio/sb/miniconda3/envs/symbolicregression39/lib/python3.9/site-packages/scipy/optimize/_differentiable_functions.py", line 251, in _update_fun
self._update_fun_impl()
File "/p/project/hai_microbio/sb/miniconda3/envs/symbolicregression39/lib/python3.9/site-packages/scipy/optimize/_differentiable_functions.py", line 155, in update_fun
self.f = fun_wrapped(self.x)
File "/p/project/hai_microbio/sb/miniconda3/envs/symbolicregression39/lib/python3.9/site-packages/scipy/optimize/_differentiable_functions.py", line 137, in fun_wrapped
fx = fun(np.copy(x), *args)
File "/p/project/hai_microbio/sb/repos/updated/ProGED/ProGED/parameter_estimation.py", line 276, in f
return estimator.settings["parameter_estimation"]["objective_function"](params, model, estimator)
File "/p/project/hai_microbio/sb/repos/updated/ProGED/ProGED/parameter_estimation.py", line 303, in objective_algebraic
raise ValueError("Left and right hand side variables are the same. This should not happen in algebraic models.")
ValueError: Left and right hand side variables are the same. This should not happen in algebraic models.
Actually, the above error can be reproduced with minimal modifications to your script - just remove the y-variable and adjust the system_size
import numpy as np
import pandas as pd
import ProGED as pg
np.random.seed(2)
# Generate and prepare data
t = np.arange(0, 1, 0.1)
x = 3*np.exp(-2*t)
# y = 5.1*np.exp(-1*t)
# data = pd.DataFrame(np.vstack((t, x, y)).T, columns=['t', 'x', 'y'])
data = pd.DataFrame(np.vstack((t, x)).T, columns=['t', 'x'])
# Set up equation discovery
ED = pg.EqDisco(data = data,
task_type="differential", #either algebraic or differential
lhs_vars = ["x"], # the variables on the left-hand side, the number should match system_size
rhs_vars = ["t", "x"], # the variables that can appear on the right-hand side
sample_size = 3, # how many random candidate systems to generate
system_size = 1, # how many equations in the system of equations
generator_template_name = "polynomial", # the grammar to use; you can write your own
strategy_settings = {"max_repeat": 100}) # gives more attempts to generator, ensuring we get 3
print("Generating models...")
ED.generate_models()
print(ED.models)
print("Estimating parameters...")
ED.fit_models()
print("The best discovered model:")
print(ED.get_results())
This raises
ValueError: Left and right hand side variables are the same. This should not happen in algebraic models.
Hi!
Regarding your first issue (FloatingPointError: overflow encountered in square), this is strange. The program already does the wrapping you suggest, and should be outputting only warnings for stuff like overflows or root of a negative number. When I try your code, I get a few warnings, but the code completes. Perhaps this could be due to a different package versions (numpy, sympy, etc.)?
The second issue turned out to be a bug, related to our recent introduction of config files - the program was not registring your setting of task_type="differential", but kept the default "algebraic". The config files require more fixing, but this specific issue should be resolved now. In my tests, the code works for all combinations of x,y on the LHS and x,y,t on the RHS.
Thanks for helping us improve the program!
On the first issue: I see - in fact, I am not getting the error any longer (but unfortunately don't know what exactly I changed, perhaps you are right and it was some versioning problem.
On the second issue: Thanks for fixing the problem - I re-installed the most recent version and now it seems to work.
Thanks for your help!
Hi, thanks again for open sourcing your repository. Could you provide a simple example on how to apply your model for discovery of ODE systems (e.g. 2 variables). I had a look at the lorenz.py example but I get the following error:
(I am also guessing that the code has changed since the lorenz.py example was added.)
My fundamental question is the following: When I have a system of equations, do I need to run
fit()
multiple times, once per component of the system (as specified bytarget_variable_index
). If so, how do I deal with the fact that the equation discoverer usually outputs many candidates (=sample_size
), i.e., I would get multiple candidates per component. What's the best way to combine the candidate of different components, is it just the cartesian product?Thanks a lot for you help!