Nixtla / statsforecast

Lightning ⚡️ fast forecasting with statistical and econometric models.
https://nixtlaverse.nixtla.io/statsforecast
Apache License 2.0
3.84k stars 267 forks source link

Wrong shape encountered during cross validation #602

Closed yuhuishi-convect closed 8 months ago

yuhuishi-convect commented 1 year ago

What happened + What you expected to happen

Encounter wrong shape when running cross validation using AutoARIMA

In [24]: sf.cross_validation(100, df, fitted=True)
<__array_function__ internals>:200: RuntimeWarning: invalid value encountered in cast
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-24-cd16f7527591> in <module>
----> 1 sf.cross_validation(100, df, fitted=True)

~/temp/nixtla/.venv/lib/python3.8/site-packages/statsforecast/core.py in cross_validation(self, h, df, n_windows, step_size, test_size, input_size, level, fitted, refit, sort_df)
   1515     ):
   1516         if self._is_native(df=df):
-> 1517             return super().cross_validation(
   1518                 h=h,
   1519                 df=df,

~/temp/nixtla/.venv/lib/python3.8/site-packages/statsforecast/core.py in cross_validation(self, h, df, n_windows, step_size, test_size, input_size, level, fitted, refit, sort_df)
    850         _, level = self._parse_X_level(h=h, X=None, level=level)
    851         if self.n_jobs == 1:
--> 852             res_fcsts = self.ga.cross_validation(
    853                 models=self.models,
    854                 h=h,

~/temp/nixtla/.venv/lib/python3.8/site-packages/statsforecast/core.py in cross_validation(self, models, h, test_size, fallback_model, step_size, input_size, fitted, level, refit, verbose)
    292                     else None
    293                 )
--> 294                 out[i_ts, i_window, :, 0] = y_test[:, 0] if y.ndim == 2 else y_test
    295                 if fitted:
    296                     fitted_vals[self.indptr[i_ts] : self.indptr[i_ts + 1], i_window, 0][

ValueError: could not broadcast input array from shape (41,) into shape (100,)

Versions / Dependencies

Package                Version
---------------------- ---------
adagio                 0.2.4
ansi2html              1.8.0
antlr4-python3-runtime 4.11.1
appdirs                1.4.4
asttokens              2.2.1
backcall               0.2.0
certifi                2023.7.22
charset-normalizer     3.2.0
click                  8.1.6
contourpy              1.1.0
cycler                 0.11.0
dash                   2.11.1
dash-core-components   2.0.0
dash-html-components   2.0.0
dash-table             5.0.0
decorator              5.1.1
executing              1.2.0
Flask                  2.2.5
fonttools              4.42.0
fs                     2.4.16
fsspec                 2023.6.0
fugue                  0.8.6
fugue-sql-antlr        0.1.6
idna                   3.4
importlib-metadata     6.8.0
importlib-resources    6.0.0
ipython                8.12.2
itsdangerous           2.1.2
jedi                   0.19.0
Jinja2                 3.1.2
kiwisolver             1.4.4
llvmlite               0.40.1
MarkupSafe             2.1.3
matplotlib             3.7.2
matplotlib-inline      0.1.6
nest-asyncio           1.5.7
numba                  0.57.1
numpy                  1.24.4
orjson                 3.9.2
packaging              23.1
pandas                 2.0.3
parso                  0.8.3
patsy                  0.5.3
pexpect                4.8.0
pickleshare            0.7.5
Pillow                 10.0.0
pip                    20.1.1
plotly                 5.15.0
plotly-resampler       0.9.1
prompt-toolkit         3.0.39
ptyprocess             0.7.0
pure-eval              0.2.2
pyarrow                12.0.1
Pygments               2.15.1
pyparsing              3.0.9
python-dateutil        2.8.2
pytz                   2023.3
qpd                    0.4.4
requests               2.31.0
retrying               1.3.4
scipy                  1.10.1
setuptools             47.1.0
six                    1.16.0
sqlglot                17.9.1
stack-data             0.6.2
statsforecast          1.5.0
statsmodels            0.14.0
tenacity               8.2.2
tqdm                   4.65.0
trace-updater          0.0.9.1
traitlets              5.9.0
triad                  0.9.1
tsdownsample           0.1.2
typing-extensions      4.7.1
tzdata                 2023.3
urllib3                2.0.4
wcwidth                0.2.6
Werkzeug               2.2.3
zipp                   3.16.2

Reproduction script

from statsforecast import StatsForecast
import pandas as pd
from statsforecast.models import AutoARIMA

df = pd.read_csv("nixtla_debug.csv")
df  = df.drop(["Unnamed: 0"], axis=1)

sf = StatsForecast([AutoARIMA(season_length=1)], freq='D')

sf.cross_validation(100, df, fitted=True)

test data: nixtla_debug.csv

Issue Severity

High: It blocks me from completing my task.

jmoralez commented 1 year ago

Hey @yuhuishi-convect, thanks for the excellent report. The problem here is that some series are too short for the horizon. In this specific case the error is raised for the unique id 'Litchi(Indian)' which has only 41 points, when we need at least 102 (2 train + 100 for validation).

We're going to add a more descriptive error message for this case but it'll still raise an error, the solution would be to either drop the series that are too short or set a shorter horizon. Please let us know if this helps.

jmoralez commented 1 year ago

Hey @chandrevdw31. Is it the same problem that I described (not enough datapoints for the cross validation settings)?

paxcema commented 1 year ago

@jmoralez hi José, the last error above is unrelated to statsforecast at all, thanks for the quick answer though!

jmoralez commented 1 year ago

Thanks for the heads up @paxcema!

For reference, we recently added an earlier validation for the original issue here (in #610), which will be raised sooner and hopefully be clearer.

@yuhuishi-convect Please let us know if you need any more help.

yuhuishi-convect commented 1 year ago

@jmoralez I tried to reduce the periods from 100 to 20. Then I got this error:

Error logs ``` In [1]: from statsforecast import StatsForecast ...: import pandas as pd ...: from statsforecast.models import AutoARIMA ...: ...: ...: df = pd.read_csv("nixtla_debug.csv") ...: df = df.drop(["Unnamed: 0"], axis=1) ...: ...: sf = StatsForecast([AutoARIMA(season_length=1)], freq='D') In [2]: sf.cross_validation(20, df, fitted=True) <__array_function__ internals>:200: RuntimeWarning: invalid value encountered in cast /home/yuhuishi/temp/nixtla/.venv/lib/python3.8/site-packages/statsforecast/arima.py:1251: RuntimeWarning: divide by zero encountered in scalar divide fit["aicc"] = fit["aic"] + 2 * npar * (npar + 1) / (nstar - npar - 1) /home/yuhuishi/temp/nixtla/.venv/lib/python3.8/site-packages/statsforecast/arima.py:1251: RuntimeWarning: divide by zero encountered in scalar divide fit["aicc"] = fit["aic"] + 2 * npar * (npar + 1) / (nstar - npar - 1) /home/yuhuishi/temp/nixtla/.venv/lib/python3.8/site-packages/statsforecast/arima.py:1251: RuntimeWarning: divide by zero encountered in scalar divide fit["aicc"] = fit["aic"] + 2 * npar * (npar + 1) / (nstar - npar - 1) /home/yuhuishi/temp/nixtla/.venv/lib/python3.8/site-packages/statsforecast/arima.py:1251: RuntimeWarning: divide by zero encountered in scalar divide fit["aicc"] = fit["aic"] + 2 * npar * (npar + 1) / (nstar - npar - 1) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[2], line 1 ----> 1 sf.cross_validation(20, df, fitted=True) File ~/temp/nixtla/.venv/lib/python3.8/site-packages/statsforecast/core.py:1517, in StatsForecast.cross_validation(self, h, df, n_windows, step_size, test_size, input_size, level, fitted, refit, sort_df) 1503 def cross_validation( 1504 self, 1505 h: int, (...) 1514 sort_df: bool = True, 1515 ): 1516 if self._is_native(df=df): -> 1517 return super().cross_validation( 1518 h=h, 1519 df=df, 1520 n_windows=n_windows, 1521 step_size=step_size, 1522 test_size=test_size, 1523 input_size=input_size, 1524 level=level, 1525 fitted=fitted, 1526 refit=refit, 1527 sort_df=sort_df, 1528 ) 1529 assert df is not None 1530 with fa.engine_context(infer_by=[df]) as e: File ~/temp/nixtla/.venv/lib/python3.8/site-packages/statsforecast/core.py:852, in _StatsForecast.cross_validation(self, h, df, n_windows, step_size, test_size, input_size, level, fitted, refit, sort_df) 850 _, level = self._parse_X_level(h=h, X=None, level=level) 851 if self.n_jobs == 1: --> 852 res_fcsts = self.ga.cross_validation( 853 models=self.models, 854 h=h, 855 test_size=test_size, 856 fallback_model=self.fallback_model, 857 step_size=step_size, 858 input_size=input_size, 859 fitted=fitted, 860 level=level, 861 verbose=self.verbose, 862 refit=refit, 863 ) 864 else: 865 res_fcsts = self._cross_validation_parallel( 866 h=h, 867 test_size=test_size, (...) 872 refit=refit, 873 ) File ~/temp/nixtla/.venv/lib/python3.8/site-packages/statsforecast/core.py:332, in GroupedArray.cross_validation(self, models, h, test_size, fallback_model, step_size, input_size, fitted, level, refit, verbose) 323 res_i = fallback_model.forecast( 324 h=h, 325 y=y_train, (...) 329 **kwargs, 330 ) 331 else: --> 332 raise error 333 else: 334 if i_window == 0: 335 # for the first window we have to fit each model File ~/temp/nixtla/.venv/lib/python3.8/site-packages/statsforecast/core.py:313, in GroupedArray.cross_validation(self, models, h, test_size, fallback_model, step_size, input_size, fitted, level, refit, verbose) 311 if refit: 312 try: --> 313 res_i = model.forecast( 314 h=h, 315 y=y_train, 316 X=X_train, 317 X_future=X_future, 318 fitted=fitted, 319 **kwargs, 320 ) 321 except Exception as error: 322 if fallback_model is not None: File ~/temp/nixtla/.venv/lib/python3.8/site-packages/statsforecast/models.py:374, in AutoARIMA.forecast(self, y, h, X, X_future, level, fitted) 347 """Memory Efficient AutoARIMA predictions. 348 349 This method avoids memory burden due from object storage. (...) 371 Dictionary with entries `mean` for point predictions and `level_*` for probabilistic predictions. 372 """ 373 with np.errstate(invalid="ignore"): --> 374 mod = auto_arima_f( 375 x=y, 376 d=self.d, 377 D=self.D, 378 max_p=self.max_p, 379 max_q=self.max_q, 380 max_P=self.max_P, 381 max_Q=self.max_Q, 382 max_order=self.max_order, 383 max_d=self.max_d, 384 max_D=self.max_D, 385 start_p=self.start_p, 386 start_q=self.start_q, 387 start_P=self.start_P, 388 start_Q=self.start_Q, 389 stationary=self.stationary, 390 seasonal=self.seasonal, 391 ic=self.ic, 392 stepwise=self.stepwise, 393 nmodels=self.nmodels, 394 trace=self.trace, 395 approximation=self.approximation, 396 method=self.method, 397 truncate=self.truncate, 398 xreg=X, 399 test=self.test, 400 test_kwargs=self.test_kwargs, 401 seasonal_test=self.seasonal_test, 402 seasonal_test_kwargs=self.seasonal_test_kwargs, 403 allowdrift=self.allowdrift, 404 allowmean=self.allowmean, 405 blambda=self.blambda, 406 biasadj=self.biasadj, 407 parallel=self.parallel, 408 num_cores=self.num_cores, 409 period=self.season_length, 410 ) 411 fcst = forecast_arima(mod, h, xreg=X_future, level=level) 412 res = {"mean": fcst["mean"]} File ~/temp/nixtla/.venv/lib/python3.8/site-packages/statsforecast/arima.py:2053, in auto_arima_f(x, d, D, max_p, max_q, max_P, max_Q, max_order, max_d, max_D, start_p, start_q, start_P, start_Q, stationary, seasonal, ic, stepwise, nmodels, trace, approximation, method, truncate, xreg, test, test_kwargs, seasonal_test, seasonal_test_kwargs, allowdrift, allowmean, blambda, biasadj, parallel, num_cores, period) 2041 results = np.full((nmodels, 8), np.nan) 2042 p_myarima = partial( 2043 myarima, 2044 x=x, (...) 2051 method=method, 2052 ) -> 2053 bestfit = p_myarima( 2054 order=(p, d, q), 2055 seasonal={"order": (P, D, Q), "period": m}, 2056 ) 2057 results[0] = (p, d, q, P, D, Q, constant, bestfit["ic"]) 2058 fit = p_myarima( 2059 order=(0, d, 0), 2060 seasonal={"order": (0, D, 0), "period": m}, 2061 ) File ~/temp/nixtla/.venv/lib/python3.8/site-packages/statsforecast/arima.py:1288, in myarima(x, order, seasonal, constant, ic, trace, approximation, offset, xreg, method, **kwargs) 1286 return fit 1287 except ValueError as e: -> 1288 raise e 1289 return {"ic": math.inf} File ~/temp/nixtla/.venv/lib/python3.8/site-packages/statsforecast/arima.py:1241, in myarima(x, order, seasonal, constant, ic, trace, approximation, offset, xreg, method, **kwargs) 1237 fit = arima( 1238 x, order, seasonal, include_mean=constant, method=method, xreg=xreg 1239 ) 1240 else: -> 1241 fit = arima(x, order, include_mean=constant, method=method, xreg=xreg) 1242 # nxreg = 0 if xreg is None else xreg.shape[1] 1243 nstar = n - order[1] - seas_order[1] * m File ~/temp/nixtla/.venv/lib/python3.8/site-packages/statsforecast/arima.py:734, in arima(x, order, seasonal, xreg, include_mean, transform_pars, fixed, init, method, SSinit, optim_method, kappa, tol, optim_control) 731 n = len(x) 733 if len(order) != 3 or any(o < 0 or not isinstance(o, int) for o in order): --> 734 raise ValueError(f"order must be 3 non-negative integers, got {order}") 735 if "order" not in seasonal: 736 raise ValueError("order must be a key in seasonal") ValueError: order must be 3 non-negative integers, got (0, 0, 0) ```
jmoralez commented 1 year ago

Ah yes, I recently got that error too. It's coming from here https://github.com/Nixtla/statsforecast/blob/12e2dda56ace711383ccd66528bc1648c9fd8d3a/statsforecast/arima.py#L1816 because the series length is used to define the max terms for the order and np.sum returns an np.int, not a python int. I'll create a PR to fix this. In the meantime, can you try wrapping that in an int to see if it solves your problem? This should be the diff:

diff --git a/statsforecast/arima.py b/statsforecast/arima.py
index ccb67f0..06929dc 100644
--- a/statsforecast/arima.py
+++ b/statsforecast/arima.py
@@ -1813,7 +1813,7 @@ def auto_arima_f(
     nonmissing_idxs = np.where(~missing)[0]
     firstnonmiss = nonmissing_idxs.min()
     lastnonmiss = nonmissing_idxs.max()
-    series_len = np.sum(~missing[firstnonmiss:lastnonmiss])
+    series_len = int(np.sum(~missing[firstnonmiss:lastnonmiss]))
     x = x[firstnonmiss:]
     if xreg is not None:
         if xreg.dtype not in (np.float32, np.float64):

This is with respect to the current main so the line numbers may be different, but it's at the very start of the auto_arima_f function.

Sorry for the troubles.

yuhuishi-convect commented 1 year ago

@jmoralez should I just upgrade to a new version of statsforecast and try it against the dataset?

Since this is not released to pypi, so maybe I need to pip install against the github repo directly to get the fixed version?

jmoralez commented 1 year ago

We'll make a release in the following days, but in the meantime installing from github should work. Please let us know if you get the chance to try the fix and run into any more issues.

jmoralez commented 8 months ago

I believe this has been fixed, feel free to reopen if you run into this issue.