model training got stuck when running the official tutorial example

hxuaj commented 3 months ago

What happened + What you expected to happen

Hi, I'm new to nixtla. When I was trying to run the example code in official tutorial on my local machine(Linux, CentOS): https://nixtlaverse.nixtla.io/neuralforecast/examples/getting_started_complete.html, I found it got stuck at nf.fit(df=Y_df) step:

2024-03-21 17:00:29,350 INFO worker.py:1724 -- Started a local Ray instance.
2024-03-21 17:00:29,926 INFO tune.py:220 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Tuner(...)`.
2024-03-21 17:00:29,927 INFO tune.py:592 -- [output] This will use the new output engine with verbosity 0. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949
╭────────────────────────────────────────────────────────────────────╮
│ Configuration for experiment     _train_tune_2024-03-21_17-00-27   │
├────────────────────────────────────────────────────────────────────┤
│ Search algorithm                 BasicVariantGenerator             │
│ Scheduler                        FIFOScheduler                     │
│ Number of trials                 5                                 │
╰────────────────────────────────────────────────────────────────────╯

View detailed results here: /root/ray_results/_train_tune_2024-03-21_17-00-27
To visualize your results with TensorBoard, run: `tensorboard --logdir /root/ray_results/_train_tune_2024-03-21_17-00-27`
(_train_tune pid=2885517) Seed set to 11
(_train_tune pid=2885517) [rank: 0] Seed set to 11
(_train_tune pid=2885517) Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2

The processe I did to set up:

use conda to create a new environment with Python 3.9
run pip install statsforecast s3fs datasetsforecast in the tutorial example.
run pip install git+https://github.com/Nixtla/neuralforecast.git@main in the tutorial example.
run pip install matplotlib in order to get the 3rd step of the tutorial work.
I change the code in: nf = NeuralForecast( models=[ AutoNHITS(h=48, config=config_nhits, loss=MQLoss(), num_samples=5), AutoLSTM(h=48, config=config_lstm, loss=MQLoss(), num_samples=2), ], freq='H' ) with freq='H' to freq=1 since ValueError: Time column contains integers but the specified frequency is not an integer. Please provide a valid integer, e.g. 'freq=1'

I was wondering what could possibly go wrong in the upper steps and why it got stuck at the training process.

Then, I tried the tutorial notebook in Colab. The fit process can be done, though there is an error when evaluation evaluation_df = accuracy(cv_df, [mse, mae, rmse], agg_by=['unique_id']):

ValueError                                Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/triad/collections/schema.py](https://localhost:8080/#) in append(self, obj)
    359             elif isinstance(obj, pd.DataFrame):
--> 360                 self._append_pa_schema(PD_UTILS.to_schema(obj))
    361             elif isinstance(obj, Tuple):  # type: ignore

11 frames
ValueError: pandas like datafame index can't have name

During handling of the above exception, another exception occurred:

SchemaError                               Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/triad/collections/schema.py](https://localhost:8080/#) in append(self, obj)
    370             raise
    371         except Exception as e:
--> 372             raise SchemaError(str(e))
    373 
    374     def remove(  # noqa: C901

SchemaError: pandas like datafame index can't have name

Looking forward to your reply.

Versions / Dependencies

OS: Linux CentOS neuralforecast 1.6.4 python 3.9.18 ray 2.9.3 torch 2.2.1 transformers 4.39.0 pandas 2.2.1

Reproduction script

Official tutorial example: https://nixtlaverse.nixtla.io/neuralforecast/examples/getting_started_complete.html

Issue Severity

High: It blocks me from completing my task.

jmoralez commented 3 months ago

Hey @hxuaj, sorry for the troubles.

The first error should be fixed by setting the CUDA_VISIBLE_DEVICES env variable to one of your devices (0 or 1), either through the terminal or in your session with os.environ.

The second error I'm guessing refers to the fact that the dataframe has an index, but we're deprecating the datasetsforecast losses, so that should do something like this instead:

from utilsforecast.evaluation import evaluate
from utilsforecast.losses import mse, mae, rmse

evaluation_df = evaluate(cv_df, [mse, mae, rmse])

hxuaj commented 3 months ago

Hey @hxuaj, sorry for the troubles.

The first error should be fixed by setting the CUDA_VISIBLE_DEVICES env variable to one of your devices (0 or 1), either through the terminal or in your session with os.environ.

The second error I'm guessing refers to the fact that the dataframe has an index, but we're deprecating the datasetsforecast losses, so that should do something like this instead:
from utilsforecast.evaluation import evaluate
from utilsforecast.losses import mse, mae, rmse

evaluation_df = evaluate(cv_df, [mse, mae, rmse])

Hi @jmoralez , Thx for the quick reply. For the first error, my local machine has 2 gpus, seems like a bug with Pytorch lightning: https://github.com/Lightning-AI/pytorch-lightning/issues/4612. However I didn't find a proper solution to this. Just as you suggested, now I can run model fit with only one gpu visible as a workaround. For the second error, I changed the code to:

from utilsforecast.evaluation import evaluate
from utilsforecast.losses import mse, mae, rmse

cv_df.reset_index(inplace=True)
evaluation_df = evaluate(cv_df, [mse, mae, rmse])

Just add index to df before evaluation. Now it works fine.

Could you update the relevant parts in this official tutorial, since it might be frustrated to encounter such error in the exampls. Thank you again.

grant-d commented 2 months ago

Just add index to df before evaluation. Now it works fine.

Just ran into the same issue, your workaround fixed it, thanks @hxuaj @jmoralez , BTW, the error has a typo - datafame vs dataf_r_ame (and may as well fix the grammer too: pandas-like dataframe index can't have name)

jmoralez commented 2 months ago

BTW, the error has a typo

That's not coming from our libs, feel free to open an issue in the corresponding lib.

Nixtla / neuralforecast