Nixtla / mlforecast

Scalable machine 🤖 learning for time series forecasting.
https://nixtlaverse.nixtla.io/mlforecast
Apache License 2.0
841 stars 80 forks source link

cross_validation of DistributedMLForecast not working when n_windows > 2 #251

Closed wregter closed 10 months ago

wregter commented 10 months ago

What happened + What you expected to happen

Thank you for developing this amazing python library!

My issue:

Using cross_validation with DistributedMLForecast does not work for me when n_windows > 2. This produces a schema error. I suspect that something is going wrong when merging the result dataframes when there are more than 2 results. In the reproduction script I use a dataset that should have more than enough observations to do the cross-validation.

The error I get is the following:

SchemaError: Schema can't be empty

SchemaError Traceback (most recent call last) File , line 10 6 series_spark = spark.createDataFrame(series).repartitionByRange(20, "unique_id") 8 fcst = DistributedMLForecast(models=[SparkLGBMForecast()], freq="D", lags=[1]) ---> 10 cv_results = fcst.cross_validation(series_spark, n_windows=3, h=5)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c61b7c9a-05b7-4802-a72b-408338297e70/lib/python3.10/site-packages/mlforecast/utils.py:164, in old_kw_to_pos..decorator..inner(*args, *kwargs) 162 new_args.append(kwargs.pop(arg_names[i])) 163 new_args.append(kwargs.pop(old_name)) --> 164 return f(new_args, **kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c61b7c9a-05b7-4802-a72b-408338297e70/lib/python3.10/site-packages/mlforecast/distributed/forecast.py:653, in DistributedMLForecast.cross_validation(self, df, n_windows, h, id_col, time_col, target_col, step_size, static_features, dropna, keep_last_n, refit, before_predict_callback, after_predict_callback, input_size, data, window_size) 651 if len(results) == 2: 652 return fa.union(results[0], results[1]) --> 653 return fa.union(results[0], results[1], results[2:])

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c61b7c9a-05b7-4802-a72b-408338297e70/lib/python3.10/site-packages/fugue/execution/api.py:869, in union(df1, df2, distinct, engine, engine_conf, as_fugue, as_local, dfs) 866 res = e.union(res, as_fugue_engine_df(e, odf), distinct=distinct) 867 return res --> 869 return run_engine_function( 870 _union, 871 engine=engine, 872 engine_conf=engine_conf, 873 as_fugue=as_fugue, 874 as_local=as_local, 875 infer_by=[df1, df2, dfs], 876 )

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c61b7c9a-05b7-4802-a72b-408338297e70/lib/python3.10/site-packages/fugue/execution/api.py:172, in run_engine_function(func, engine, engine_conf, as_fugue, as_local, infer_by) 153 """Run a lambda function based on the engine provided 154 155 :param engine: an engine like object, defaults to None (...) 169 This function is for deveopment use. Users should not need it. 170 """ 171 with engine_context(engine, engine_conf=engine_conf, infer_by=infer_by) as e: --> 172 res = func(e) 174 if isinstance(res, DataFrame): 175 res = e.convert_yield_dataframe(res, as_local=as_local)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c61b7c9a-05b7-4802-a72b-408338297e70/lib/python3.10/site-packages/fugue/execution/api.py:866, in union.._union(e) 864 res = e.union(edf1, edf2, distinct=distinct) 865 for odf in dfs: --> 866 res = e.union(res, as_fugue_engine_df(e, odf), distinct=distinct) 867 return res

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c61b7c9a-05b7-4802-a72b-408338297e70/lib/python3.10/site-packages/triad/utils/dispatcher.py:111, in conditional_dispatcher.._run.._Dispatcher.call(self, *args, kwds) 110 def call(self, *args: Any, *kwds: Any) -> Any: --> 111 return self.run_top(args, kwds)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c61b7c9a-05b7-4802-a72b-408338297e70/lib/python3.10/site-packages/triad/utils/dispatcher.py:268, in ConditionalDispatcher.run_top(self, *args, kwargs) 263 def run_top(self, *args: Any, *kwargs: Any) -> Any: 264 """Execute the first matching child function 265 266 :return: the return of the child function 267 """ --> 268 return list(itertools.islice(self.run(args, kwargs), 1))[0]

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c61b7c9a-05b7-4802-a72b-408338297e70/lib/python3.10/site-packages/triad/utils/dispatcher.py:261, in ConditionalDispatcher.run(self, *args, *kwargs) 259 has_return = True 260 if not has_return: --> 261 yield self._func(args, **kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c61b7c9a-05b7-4802-a72b-408338297e70/lib/python3.10/site-packages/fugue/execution/api.py:139, in as_fugue_engine_df(engine, df, schema) 128 """Convert a dataframe to a Fugue engine dependent DataFrame. 129 This function is used internally by Fugue. It is not recommended 130 to use (...) 136 :return: the engine dependent DataFrame 137 """ 138 if schema is None: --> 139 fdf = as_fugue_df(df) 140 else: 141 fdf = as_fugue_df(df, schema=schema)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c61b7c9a-05b7-4802-a72b-408338297e70/lib/python3.10/site-packages/fugue/dataframe/dataframe.py:464, in as_fugue_df(df, kwargs) 459 def as_fugue_df(df: AnyDataFrame, kwargs: Any) -> DataFrame: 460 """Wrap the object as a Fugue DataFrame. 461 462 :param df: the object to wrap 463 """ --> 464 ds = as_fugue_dataset(df, **kwargs) 465 if isinstance(ds, DataFrame): 466 return ds

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c61b7c9a-05b7-4802-a72b-408338297e70/lib/python3.10/site-packages/triad/utils/dispatcher.py:111, in conditional_dispatcher.._run.._Dispatcher.call(self, *args, kwds) 110 def call(self, *args: Any, *kwds: Any) -> Any: --> 111 return self.run_top(args, kwds)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c61b7c9a-05b7-4802-a72b-408338297e70/lib/python3.10/site-packages/triad/utils/dispatcher.py:268, in ConditionalDispatcher.run_top(self, *args, kwargs) 263 def run_top(self, *args: Any, *kwargs: Any) -> Any: 264 """Execute the first matching child function 265 266 :return: the return of the child function 267 """ --> 268 return list(itertools.islice(self.run(args, kwargs), 1))[0]

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c61b7c9a-05b7-4802-a72b-408338297e70/lib/python3.10/site-packages/triad/utils/dispatcher.py:258, in ConditionalDispatcher.run(self, *args, kwargs) 256 for f in self._funcs: 257 if self._match(f[2], *args, *kwargs): --> 258 yield f[3](args, kwargs) 259 has_return = True 260 if not has_return:

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c61b7c9a-05b7-4802-a72b-408338297e70/lib/python3.10/site-packages/fugue/dataframe/array_dataframe.py:128, in _arr_to_fugue(df, kwargs) 126 @as_fugue_dataset.candidate(lambda df, kwargs: isinstance(df, list), priority=0.9) 127 def _arr_to_fugue(df: List[Any], kwargs: Any) -> ArrayDataFrame: --> 128 return ArrayDataFrame(df, kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c61b7c9a-05b7-4802-a72b-408338297e70/lib/python3.10/site-packages/fugue/dataframe/array_dataframe.py:41, in ArrayDataFrame.init(self, df, schema) 39 self._native = df.as_array(schema.names, type_safe=False) 40 elif isinstance(df, Iterable): ---> 41 super().init(schema) 42 self._native = df if isinstance(df, List) else list(df) 43 else:

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c61b7c9a-05b7-4802-a72b-408338297e70/lib/python3.10/site-packages/fugue/dataframe/dataframe.py:44, in DataFrame.init(self, schema) 42 super().init() 43 if not callable(schema): ---> 44 schema = _input_schema(schema).assert_not_empty() 45 schema.set_readonly() 46 self._schema: Union[Schema, Callable[[], Schema]] = schema

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c61b7c9a-05b7-4802-a72b-408338297e70/lib/python3.10/site-packages/triad/collections/schema.py:152, in Schema.assert_not_empty(self) 150 if len(self) > 0: 151 return self --> 152 raise SchemaError("Schema can't be empty")

SchemaError: Schema can't be empty

Versions / Dependencies

Package Version


absl-py 1.0.0 accelerate 0.20.3 adagio 0.2.4 aiohttp 3.8.5 aiosignal 1.3.1 ansi2html 1.8.0 antlr4-python3-runtime 4.11.1 anyio 3.5.0 appdirs 1.4.4 argon2-cffi 21.3.0 argon2-cffi-bindings 21.2.0 astor 0.8.1 asttokens 2.2.1 astunparse 1.6.3 async-timeout 4.0.3 attrs 21.4.0 audioread 3.0.0 azure-core 1.29.1 azure-cosmos 4.3.1 azure-storage-blob 12.17.0 azure-storage-file-datalake 12.12.0 backcall 0.2.0 bcrypt 3.2.0 beautifulsoup4 4.11.1 black 22.6.0 bleach 4.1.0 blinker 1.4 blis 0.7.10 boto3 1.24.28 botocore 1.27.28 cachetools 4.2.4 catalogue 2.0.9 category-encoders 2.6.1 certifi 2022.9.14 cffi 1.15.1 chardet 4.0.0 charset-normalizer 2.0.4 click 8.0.4 cloudpickle 2.0.0 cmdstanpy 1.1.0 confection 0.1.1 configparser 5.2.0 convertdate 2.4.0 cryptography 37.0.1 cycler 0.11.0 cymem 2.0.7 Cython 0.29.32 dacite 1.8.1 dash 2.14.0 dash-core-components 2.0.0 dash-html-components 2.0.0 dash-table 5.0.0 databricks-automl-runtime 0.2.17 databricks-cli 0.17.7 databricks-feature-store 0.14.1 databricks-sdk 0.1.6 dataclasses-json 0.5.14 datasets 2.13.1 db-dtypes 1.1.1 dbl-tempo 0.1.23 dbus-python 1.2.18 debugpy 1.6.0 decorator 5.1.1 defusedxml 0.7.1 dill 0.3.4 diskcache 5.6.1 distlib 0.3.7 distro 1.7.0 distro-info 1.1+ubuntu0.1 docstring-to-markdown 0.12 entrypoints 0.4 ephem 4.1.4 evaluate 0.4.0 executing 1.2.0 facets-overview 1.0.3 fastapi 0.98.0 fastjsonschema 2.18.0 fasttext 0.9.2 filelock 3.6.0 Flask 1.1.2+db1 flatbuffers 23.5.26 fonttools 4.25.0 frozenlist 1.4.0 fs 2.4.16 fsspec 2022.7.1 fugue 0.8.6 fugue-sql-antlr 0.1.8 future 0.18.2 gast 0.4.0 gitdb 4.0.10 GitPython 3.1.27 google-api-core 2.12.0 google-auth 2.23.3 google-auth-oauthlib 1.1.0 google-cloud-bigquery 3.11.3 google-cloud-bigquery-storage 2.22.0 google-cloud-core 2.3.3 google-cloud-storage 2.10.0 google-crc32c 1.5.0 google-pasta 0.2.0 google-resumable-media 2.5.0 googleapis-common-protos 1.56.4 greenlet 1.1.1 grpcio 1.48.1 grpcio-status 1.48.1 gunicorn 20.1.0 gviz-api 1.10.0 h11 0.14.0 h5py 3.7.0 hierarchicalforecast 0.3.0 holidays 0.27.1 horovod 0.28.1 htmlmin 0.1.12 httplib2 0.20.2 httptools 0.6.0 huggingface-hub 0.16.4 idna 3.3 ImageHash 4.3.1 imbalanced-learn 0.10.1 importlib-metadata 4.11.3 importlib-resources 6.0.1 ipykernel 6.17.1 ipython 8.10.0 ipython-genutils 0.2.0 ipywidgets 7.7.2 isodate 0.6.1 itsdangerous 2.0.1 jedi 0.18.1 jeepney 0.7.1 Jinja2 2.11.3 jmespath 0.10.0 joblib 1.2.0 joblibspark 0.5.1 jsonschema 4.16.0 jupyter-client 7.3.4 jupyter_core 4.11.2 jupyterlab-pygments 0.1.2 jupyterlab-widgets 1.0.0 keras 2.11.0 keyring 23.5.0 kiwisolver 1.4.2 langchain 0.0.217 langchainplus-sdk 0.0.20 langcodes 3.3.0 launchpadlib 1.10.16 lazr.restfulclient 0.14.4 lazr.uri 1.0.6 lazy_loader 0.3 libclang 15.0.6.1 librosa 0.10.0 lightgbm 3.3.5 llvmlite 0.38.0 LunarCalendar 0.0.9 lunardate 0.2.0 Mako 1.2.0 Markdown 3.3.4 MarkupSafe 2.0.1 marshmallow 3.20.1 matplotlib 3.5.2 matplotlib-inline 0.1.6 mccabe 0.7.0 mistune 0.8.4 mleap 0.20.0 mlflow-skinny 2.5.0 mlforecast 0.10.0 more-itertools 8.10.0 msgpack 1.0.5 multidict 6.0.4 multimethod 1.9.1 multiprocess 0.70.12.2 murmurhash 1.0.9 mypy-extensions 0.4.3 nbclient 0.5.13 nbconvert 6.4.4 nbformat 5.5.0 nest-asyncio 1.5.5 networkx 2.8.4 ninja 1.11.1 nltk 3.7 nodeenv 1.8.0 notebook 6.4.12 numba 0.55.1 numexpr 2.8.4 numpy 1.21.6 oauthlib 3.2.0 openai 0.27.8 openapi-schema-pydantic 1.2.4 opt-einsum 3.3.0 orjson 3.9.9 packaging 21.3 pandas 1.4.4 pandas-gbq 0.19.2 pandocfilters 1.5.0 paramiko 2.9.2 parso 0.8.3 pathspec 0.9.0 pathy 0.10.2 patsy 0.5.2 petastorm 0.12.1 pexpect 4.8.0 phik 0.12.3 pickleshare 0.7.5 Pillow 9.2.0 pip 22.2.2 platformdirs 2.5.2 plotly 5.9.0 plotly-resampler 0.9.1 pluggy 1.0.0 pmdarima 2.0.3 pooch 1.7.0 preshed 3.0.8 prometheus-client 0.14.1 prompt-toolkit 3.0.36 prophet 1.1.4 proto-plus 1.22.3 protobuf 4.24.4 psutil 5.9.0 psycopg2 2.9.3 ptyprocess 0.7.0 pure-eval 0.2.2 pyarrow 8.0.0 pyasn1 0.4.8 pyasn1-modules 0.2.8 pybind11 2.11.1 pycparser 2.21 pydantic 1.10.6 pydata-google-auth 1.8.2 pyflakes 3.0.1 Pygments 2.11.2 PyGObject 3.42.1 PyJWT 2.3.0 pyluach 2.2.0 PyMeeus 0.5.12 PyNaCl 1.5.0 pyodbc 4.0.32 pyparsing 3.0.9 pyright 1.1.294 pyrsistent 0.18.0 pytesseract 0.3.10 python-apt 2.4.0+ubuntu2 python-dateutil 2.8.2 python-dotenv 1.0.0 python-editor 1.0.4 python-lsp-jsonrpc 1.0.0 python-lsp-server 1.7.1 pytoolconfig 1.2.2 pytz 2022.1 PyWavelets 1.3.0 PyYAML 6.0 pyzmq 23.2.0 qpd 0.4.4 quadprog 0.1.11 regex 2022.7.9 requests 2.28.1 requests-oauthlib 1.3.1 responses 0.18.0 retrying 1.3.4 rope 1.7.0 rsa 4.9 s3transfer 0.6.0 safetensors 0.3.2 scikit-learn 1.1.1 scipy 1.9.1 seaborn 0.11.2 SecretStorage 3.3.1 Send2Trash 1.8.0 sentence-transformers 2.2.2 sentencepiece 0.1.99 setuptools 63.4.1 shap 0.41.0 simplejson 3.17.6 six 1.16.0 slicer 0.0.7 smart-open 5.2.1 smmap 5.0.0 sniffio 1.2.0 soundfile 0.12.1 soupsieve 2.3.1 soxr 0.3.6 spacy 3.5.3 spacy-legacy 3.0.12 spacy-loggers 1.0.4 spark-tensorflow-distributor 1.0.0 SQLAlchemy 1.4.39 sqlglot 18.15.1 sqlparse 0.4.2 srsly 2.4.7 ssh-import-id 5.11 stack-data 0.6.2 starlette 0.27.0 statsforecast 1.5.0 statsmodels 0.13.2 tabulate 0.8.10 tangled-up-in-unicode 0.2.0 tenacity 8.1.0 tensorboard 2.11.0 tensorboard-data-server 0.6.1 tensorboard-plugin-profile 2.11.2 tensorboard-plugin-wit 1.8.1 tensorflow-cpu 2.11.1 tensorflow-estimator 2.11.0 tensorflow-io-gcs-filesystem 0.33.0 termcolor 2.3.0 terminado 0.13.1 testpath 0.6.0 thinc 8.1.12 threadpoolctl 2.2.0 tiktoken 0.4.0 tokenize-rt 4.2.1 tokenizers 0.13.3 tomli 2.0.1 torch 1.13.1+cpu torchvision 0.14.1+cpu tornado 6.1 tqdm 4.64.1 trace-updater 0.0.9.1 traitlets 5.1.1 transformers 4.30.2 triad 0.9.1 tsdownsample 0.1.2 typeguard 2.13.3 typer 0.7.0 typing_extensions 4.3.0 typing-inspect 0.9.0 ujson 5.4.0 unattended-upgrades 0.1 urllib3 1.26.11 utilsforecast 0.0.10 uvicorn 0.23.2 uvloop 0.17.0 virtualenv 20.16.3 visions 0.7.5 wadllib 1.3.6 wasabi 1.1.2 watchfiles 0.19.0 wcwidth 0.2.5 webencodings 0.5.1 websocket-client 0.58.0 websockets 11.0.3 Werkzeug 2.0.3 whatthepatch 1.0.2 wheel 0.37.1 widgetsnbextension 3.6.1 window-ops 0.0.14 wordcloud 1.9.2 workalendar 17.0.0 wrapt 1.14.1 xgboost 1.7.6 xxhash 3.3.0 yapf 0.31.0 yarl 1.9.2 ydata-profiling 4.2.0 zipp 3.8.0

Reproduction script

from mlforecast.utils import generate_daily_series
from mlforecast.distributed import DistributedMLForecast
from mlforecast.distributed.models.spark.lgb import SparkLGBMForecast

series = generate_daily_series(10, equal_ends=True)
series_spark = spark.createDataFrame(series).repartitionByRange(20, "unique_id")

fcst = DistributedMLForecast(models=[SparkLGBMForecast()], freq="D", lags=[1])

cv_results = fcst.cross_validation(series_spark, n_windows=3, h=5)

Issue Severity

Medium: It is a significant difficulty but I can work around it.