dynamicslab / pysindy

A package for the sparse identification of nonlinear dynamical systems from data
https://pysindy.readthedocs.io/en/latest/
Other
1.4k stars 308 forks source link

[BUG] Cannot provide controls u for cross validation with scikit-learn #89

Open juliuskittler opened 4 years ago

juliuskittler commented 4 years ago

When you provide a control variable u to fit SINDy with cross validation from scikit-learn, there is an error:

TypeError: Model was fit using control variables, so u is required.

It seems like this error occurs in the predict and/or score functions: https://pysindy.readthedocs.io/en/latest/_modules/pysindy/pysindy.html

Possibly, the controls u, which are provided to the fit function for cross validation, are not passed on to these other functions (predict and/or score) during cross validation?

Reproducing code example:

This code example is taken from here: https://pysindy.readthedocs.io/en/latest/examples/4_scikit_learn_compatibility.html#cross-validation

However, I have added 1 constant control variable u to show that the code fails when I pass u as argument: search.fit(x_train, u = u_train). I have highlighted the relevant rows with the comment "RELEVANT ROW".

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit

import numpy as np
from scipy.integrate import odeint
import pysindy as ps

def lorenz(z, t):
    return [
        10 * (z[1] - z[0]),
        z[0] * (28 - z[2]) - z[1],
        z[0] * z[1] - (8 / 3) * z[2]
    ]

# Generate training data
dt = .002

t_train = np.arange(0, 10, dt)
x0_train = [-8, 8, 27]
x_train = odeint(lorenz, x0_train, t_train)

# Make up control input
u_train = np.ones(x_train.shape[0]) # RELEVANT ROW

# Fit model
model = ps.SINDy(t_default=dt)

param_grid = {
    "optimizer__threshold": [0.001, 0.01, 0.1],
    "optimizer__alpha": [0.01, 0.05, 0.1],
    "feature_library": [ps.PolynomialLibrary(), ps.FourierLibrary()],
    "differentiation_method__order": [1, 2]
}

search = GridSearchCV(
    model,
    param_grid,
    cv=TimeSeriesSplit(n_splits=5)
)
search.fit(x_train, u = u_train) # RELEVANT ROW

print("Best parameters:", search.best_params_)
search.best_estimator_.print()

Error message:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-76-6c5d752be13a> in <module>
     38     cv=TimeSeriesSplit(n_splits=5)
     39 )
---> 40 search.fit(x_train, u = u_train) # RELEVANT ROW
     41 
     42 print("Best parameters:", search.best_params_)

~/.cache/pypoetry/virtualenvs/powerup-process-control-VY7sLJ9r-py3.8/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

~/.cache/pypoetry/virtualenvs/powerup-process-control-VY7sLJ9r-py3.8/lib/python3.8/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
    734                 return results
    735 
--> 736             self._run_search(evaluate_candidates)
    737 
    738         # For multi-metric evaluation, store the best_index_, best_params_ and

~/.cache/pypoetry/virtualenvs/powerup-process-control-VY7sLJ9r-py3.8/lib/python3.8/site-packages/sklearn/model_selection/_search.py in _run_search(self, evaluate_candidates)
   1186     def _run_search(self, evaluate_candidates):
   1187         """Search all candidates in param_grid"""
-> 1188         evaluate_candidates(ParameterGrid(self.param_grid))
   1189 
   1190 

~/.cache/pypoetry/virtualenvs/powerup-process-control-VY7sLJ9r-py3.8/lib/python3.8/site-packages/sklearn/model_selection/_search.py in evaluate_candidates(candidate_params)
    706                               n_splits, n_candidates, n_candidates * n_splits))
    707 
--> 708                 out = parallel(delayed(_fit_and_score)(clone(base_estimator),
    709                                                        X, y,
    710                                                        train=train, test=test,

~/.cache/pypoetry/virtualenvs/powerup-process-control-VY7sLJ9r-py3.8/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
   1027             # remaining jobs.
   1028             self._iterating = False
-> 1029             if self.dispatch_one_batch(iterator):
   1030                 self._iterating = self._original_iterator is not None
   1031 

~/.cache/pypoetry/virtualenvs/powerup-process-control-VY7sLJ9r-py3.8/lib/python3.8/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    845                 return False
    846             else:
--> 847                 self._dispatch(tasks)
    848                 return True
    849 

~/.cache/pypoetry/virtualenvs/powerup-process-control-VY7sLJ9r-py3.8/lib/python3.8/site-packages/joblib/parallel.py in _dispatch(self, batch)
    763         with self._lock:
    764             job_idx = len(self._jobs)
--> 765             job = self._backend.apply_async(batch, callback=cb)
    766             # A job can complete so quickly than its callback is
    767             # called before we get here, causing self._jobs to

~/.cache/pypoetry/virtualenvs/powerup-process-control-VY7sLJ9r-py3.8/lib/python3.8/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    206     def apply_async(self, func, callback=None):
    207         """Schedule a func to be run"""
--> 208         result = ImmediateResult(func)
    209         if callback:
    210             callback(result)

~/.cache/pypoetry/virtualenvs/powerup-process-control-VY7sLJ9r-py3.8/lib/python3.8/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    570         # Don't delay the application, to avoid keeping the input
    571         # arguments in memory
--> 572         self.results = batch()
    573 
    574     def get(self):

~/.cache/pypoetry/virtualenvs/powerup-process-control-VY7sLJ9r-py3.8/lib/python3.8/site-packages/joblib/parallel.py in __call__(self)
    250         # change the default number of processes to -1
    251         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 252             return [func(*args, **kwargs)
    253                     for func, args, kwargs in self.items]
    254 

~/.cache/pypoetry/virtualenvs/powerup-process-control-VY7sLJ9r-py3.8/lib/python3.8/site-packages/joblib/parallel.py in <listcomp>(.0)
    250         # change the default number of processes to -1
    251         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 252             return [func(*args, **kwargs)
    253                     for func, args, kwargs in self.items]
    254 

~/.cache/pypoetry/virtualenvs/powerup-process-control-VY7sLJ9r-py3.8/lib/python3.8/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
    558     else:
    559         fit_time = time.time() - start_time
--> 560         test_scores = _score(estimator, X_test, y_test, scorer)
    561         score_time = time.time() - start_time - fit_time
    562         if return_train_score:

~/.cache/pypoetry/virtualenvs/powerup-process-control-VY7sLJ9r-py3.8/lib/python3.8/site-packages/sklearn/model_selection/_validation.py in _score(estimator, X_test, y_test, scorer)
    603         scorer = _MultimetricScorer(**scorer)
    604     if y_test is None:
--> 605         scores = scorer(estimator, X_test)
    606     else:
    607         scores = scorer(estimator, X_test, y_test)

~/.cache/pypoetry/virtualenvs/powerup-process-control-VY7sLJ9r-py3.8/lib/python3.8/site-packages/sklearn/metrics/_scorer.py in __call__(self, estimator, *args, **kwargs)
     88                                       *args, **kwargs)
     89             else:
---> 90                 score = scorer(estimator, *args, **kwargs)
     91             scores[name] = score
     92         return scores

~/.cache/pypoetry/virtualenvs/powerup-process-control-VY7sLJ9r-py3.8/lib/python3.8/site-packages/sklearn/metrics/_scorer.py in _passthrough_scorer(estimator, *args, **kwargs)
    370 def _passthrough_scorer(estimator, *args, **kwargs):
    371     """Function that wraps estimator.score"""
--> 372     return estimator.score(*args, **kwargs)
    373 
    374 

~/.cache/pypoetry/virtualenvs/powerup-process-control-VY7sLJ9r-py3.8/lib/python3.8/site-packages/pysindy/pysindy.py in score(self, x, t, x_dot, u, multiple_trajectories, metric, **metric_kws)
    461         if u is None or self.n_control_features_ == 0:
    462             if self.n_control_features_ > 0:
--> 463                 raise TypeError(
    464                     "Model was fit using control variables, so u is required"
    465                 )

TypeError: Model was fit using control variables, so u is required

PySINDy/Python version information:

'1.2.0'

briandesilva commented 4 years ago

Thanks for the detailed bug report. It makes it much easier for us to pin down the issue.

Possibly, the controls u, which are provided to the fit function for cross validation, are not passed on to these other functions (predict and/or score) during cross validation?

You're exactly right. I will have to investigate potential solutions.

slimeth commented 3 years ago

@juliuskittler @briandesilva Did you find a solution? I'm facing the same issue.

juliuskittler commented 3 years ago

@slimeth I don't think there is any solution yet for Grid Search with CV in scikit-learn. I switched to using Bayesian Optimization for hyperparameter tuning: https://optuna.org

briandesilva commented 3 years ago

I have been slow about implementing a solution, partly because it would require nontrivial changes to the API. I'll prioritize getting out a fix, but it may live in a separate branch for the time being.

briandesilva commented 3 years ago

@slimeth I don't think there is any solution yet for Grid Search with CV in scikit-learn. I switched to using Bayesian Optimization for hyperparameter tuning: https://optuna.org

I'd like to second the suggestion to use Bayesian optimization. It's a much more efficient approach than grid or random search. Thanks, @juliuskittler, for suggesting it!

slimeth commented 3 years ago

@juliuskittler Optuna works like a charm! Thank you for this suggestion!

alokwarey commented 3 years ago

@juliuskittler @slimeth : Is it possible to share a code snippet on use of optuna for SINDy hyperparameter tuning with control variables? I have the same issue.

Jacob-Stevens-Haas commented 2 years ago

I believe this relates to SINDy() not implementing the full Estimator interface. It is not quite clear in the documentation, but score() cannot implement any other arguments besides x and y. The code for GridSearchCV.fit bundles up extra fit_params and sends to _fit_and_score() , but the latter only sends these params to fit() and not score().

Jacob-Stevens-Haas commented 1 month ago

The design consideration here is that this is that meta-estimators (e.g. Pipeline, GridsearchCV) do not have an effective way to pass anything other than X and y to their constituent estimators. This problem was discussed most extensively in the context of models which need to know sample_weights in both fitting and scoring within a gridsearch (issue).

The solution is probably metadata routing, which was added to scikit-learn after this issue was created. It allows passing information other than X, y to steps of meta-estimators. I don't fully understand how to implement it yet, but this touches a lot of other issues that deal with pysindy's extra data-dependent parameters (e.g. constraints, controls, t). See SLEP006 for more info.