Jacob-Stevens-Haas commented 1 year ago

Problem?

score() compares the calculated derivative to the derivative estimated by predict(). It currently runs predict() before calculating the derivatives. Since pysindy now also has the capability to smooth data as it calculates derivatives, this means that even a trained SINDy() with the correct model with the ideal differentiation method should report a really bad score on noisy data. Is that desired?

Describe the solution you'd like

Smooth the data before running predict() in score()

Describe alternatives you've considered

Smoothing the data in predict()?

Less a question of "how do we fix this?", more of a question, "what are the most useful semantics of predict() and score(), what does the principle of least surprise say, and does anyone see a footgun if we change this?"

Jacob-Stevens-Haas commented 7 months ago

@MalachiteWind, FYSA. I'm going to show plots and mention this to you (since I added predict() to our experiments), so I wanted to drop you a reference to this issue.

Jacob-Stevens-Haas commented 2 months ago

Current thinking: If score is to be used for model selection, we want all bad models to eventually perform worse than good models, given enough data.

Current system

correct model with the ideal differentiation method will report a really bad score on noisy data

isn't a problem as long as a worse method performs even worse. But does it? I don't think so. The correct coefficients on a library can produce far worse predictions when calculated on noisy data than a bad model that always produces zeros. Consider the case of $\ddot x = - x^9$ with $x(0) = 0, \dot x(0)=1$, which looks close to a sawtooth wave. The true $\dot x\in (-1, 1)$, and $\ddot x\in(-4, 4)$. However, the polynomial term really amplifies noisy data to a point where the true signal is closer to zero than the predicitons.

import matplotlib.pyplot as plt
import numpy as np
from scipy.integrate import solve_ivp

t = np.linspace(0, 10, 1000)
rhs = lambda t, x: [x[1], -x[0]**9]
y = solve_ivp(rhs, t_span=(0, 10), t_eval = t, y0=[0, 1]).y[0,:]
y_dot = dxdt(y, t)
y_ddot = dxdt(y_dot, t)
y_noise = np.random.normal(loc=y, scale=.1)
y_hat_ddot = -y_noise ** 9
plt.plot(t, y_noise, ".", label="noisy $x$")
plt.plot(t, y_hat_ddot, ".", label=r"predicted $\ddot x$")
plt.plot(t, y, label="$x$")
plt.plot(t, y_ddot, label=r"$\ddot x$")
plt.legend()

np.linalg.norm(y_ddot-y_hat_ddot)/np.linalg.norm(y_ddot)

Shows that the relative mean squared error is greater than 1, meaning that the true model scores worse than a zero-model.

Smooth before predict?

The opposite phenomenon occurs if we smooth before predict: In that case, a SINDy object that over-smoothes the data and rejects all terms from the ODE would return a near-perfect score.

smooth to predict x_dot, but not measure it?

What about smoothing the data before predicting, but using FiniteDifference to evaluate predictions? So long as the FD noise is mean zero in the long run, the best model should win out.

safe option

I believe it is always safe to evaluate the model on the true x_dot and x.

dynamicslab / pysindy

Should we smooth before `predict` in `score`, or after? #372