ValueError: Trial returned a result which did not include the specified metric(s) `loss` that `tune.TuneConfig()` expects. while

xiao-he commented 1 year ago

Does anybody get "ValueError: Trial returned a result which did not include the specified metric(s) loss that tune.TuneConfig() expects. while " while using the notebook of LongHorizon_with_NHITS.ipynb?

kdgutier commented 1 year ago

Hi @xiao-he,

Thanks for reporting this. Would you be able to provide a bit more information on the bug, for me to take a closer look?

xiao-he commented 1 year ago

Hi @kdgutier,

Thanks for the reply. Here are the information to reproduce the bug:

OS: CentOS 8.3

Setup: conda create --name test python=3.8 conda activate test pip install neuralforecast datasetsforecast matplotlib

Python script (copied from https://colab.research.google.com/github/Nixtla/neuralforecast/blob/main/nbs/examples/LongHorizon_with_NHITS.ipynb): import torch import numpy as np import pandas as pd import matplotlib.pyplot as plt from ray import tune from neuralforecast.auto import AutoNHITS from neuralforecast.core import NeuralForecast from neuralforecast.losses.pytorch import MAE from neuralforecast.losses.numpy import mae, mse from datasetsforecast.long_horizon import LongHorizon

import logging
logging.getLogger("pytorch_lightning").setLevel(logging.WARNING)

# Change this to your own data to try the model
Y_df, _, _ = LongHorizon.load(directory='./', group='ETTm2')
Y_df['ds'] = pd.to_datetime(Y_df['ds'])

# For this excercise we are going to take 20% of the DataSet
n_time = len(Y_df.ds.unique())
val_size = int(.2 * n_time)
test_size = int(.2 * n_time)

# We are going to plot the temperature of the transformer
# and marking the validation and train splits
u_id = 'HUFL'
x_plot = pd.to_datetime(Y_df[Y_df.unique_id==u_id].ds)
y_plot = Y_df[Y_df.unique_id==u_id].y.values

x_val = x_plot[n_time - val_size - test_size]
x_test = x_plot[n_time - test_size]

horizon = 96 # 24hrs = 4 * 15 min.

# Use your own config or AutoNHITS.default_config
nhits_config = {
    "learning_rate": tune.choice([1e-3]),                                     # Initial Learning rate
    "max_steps": tune.choice([1000]),                                         # Number of SGD steps
    "input_size": tune.choice([5 * horizon]),                                 # input_size = multiplier * horizon
    "batch_size": tune.choice([7]),                                           # Number of series in windows
    "windows_batch_size": tune.choice([256]),                                 # Number of windows in batch
    "n_pool_kernel_size": tune.choice([[2, 2, 2], [16, 8, 1]]),               # MaxPool's Kernelsize
    "n_freq_downsample": tune.choice([[168, 24, 1], [24, 12, 1], [1, 1, 1]]), # Interpolation expressivity ratios
    "activation": tune.choice(['ReLU']),                                      # Type of non-linear activation
    "n_blocks":  tune.choice([[1, 1, 1]]),                                    # Blocks per each 3 stacks
    "mlp_units":  tune.choice([[[512, 512], [512, 512], [512, 512]]]),        # 2 512-Layers per block for each stack
    "interpolation_mode": tune.choice(['linear']),                            # Type of multi-step interpolation
    "random_seed": tune.randint(1, 10),
    }

# Fit and predict
fcst = NeuralForecast(
            models=[AutoNHITS(h=horizon, config=nhits_config,
                                    num_samples=1)], # control of hyperopt samples
                freq='15min')

fcst_df = fcst.cross_validation(df=Y_df, val_size=val_size,
                                        test_size=test_size, n_windows=None)

Error: Traceback (most recent call last): File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/tuner.py", line 234, in fit return self._local_tuner.fit() File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/impl/tuner_internal.py", line 283, in fit analysis = self._fit_internal(trainable, param_space) File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/impl/tuner_internal.py", line 380, in _fit_internal analysis = run( File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/tune.py", line 722, in run runner.step() File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/execution/trial_runner.py", line 879, in step self._wait_and_handle_event(next_trial) File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/execution/trial_runner.py", line 858, in _wait_and_handle_event raise TuneError(traceback.format_exc()) ray.tune.error.TuneError: Traceback (most recent call last): File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/execution/trial_runner.py", line 846, in _wait_and_handle_event self._on_training_result( File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/execution/trial_runner.py", line 971, in _on_training_result self._process_trial_results(trial, result) File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/execution/trial_runner.py", line 1055, in _process_trial_results decision = self._process_trial_result(trial, result) File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/execution/trial_runner.py", line 1092, in _process_trial_result self._validate_result_metrics(flat_result) File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/execution/trial_runner.py", line 1188, in _validate_result_metrics raise ValueError( ValueError: Trial returned a result which did not include the specified metric(s) loss that tune.TuneConfig() expects. Make sure your calls to tune.report() include the metric, or set the TUNE_DISABLE_STRICT_METRIC_CHECKING environment variable to 1. Result: {'trial_id': '3999c_00000', 'experiment_id': 'b5365bf9bc8c4630b158e2d2b9234b57', 'date': '2022-11-30_09-21-00', 'timestamp': 1669771260, 'pid': 836053, 'hostname': 'iZbp1c50jxlq7oez1iedf0Z', 'node_ip': '172.20.82.219', 'done': True, 'config/learning_rate': 0.001, 'config/max_steps': 5, 'config/input_size': 480, 'config/batch_size': 7, 'config/windows_batch_size': 256, 'config/n_pool_kernel_size': [16, 8, 1], 'config/n_freq_downsample': [168, 24, 1], 'config/activation': 'ReLU', 'config/n_blocks': [1, 1, 1], 'config/mlp_units': [[512, 512], [512, 512], [512, 512]], 'config/interpolation_mode': 'linear', 'config/random_seed': 5, 'config/loss': MAE(), 'config/h': 96}

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "test_nhits.py", line 62, in <module>
    fcst_df = fcst.cross_validation(df=Y_df, val_size=val_size,
File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/neuralforecast/core.py", line 262, in cross_validation
    model.fit(dataset=self.dataset,
File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/neuralforecast/common/_base_auto.py", line 152, in fit
    results = tune_model(
File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/neuralforecast/common/_base_auto.py", line 78, in tune_model
    results = tuner.fit()
File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/tuner.py", line 236, in fit
    raise TuneError(
ray.tune.error.TuneError: Tune run failed. Please use tuner = Tuner.restore("/home/xiao/ray_results/train_tune_2022-11-30_09-20-55") to resume.

kdgutier commented 1 year ago

From a quick search I found a similar error in Keras:

https://stackoverflow.com/questions/73512216/ray-tune-valueerror-trial-returned-a-result-which-did-not-include-the-specifi

They point out to the TuneReportCallback class to be used as part of the Pytorch lightning trainer.

https://docs.ray.io/en/latest/_modules/ray/tune/integration/pytorch_lightning.html

I can continue to take a closer look.

xiao-he commented 1 year ago

@kdgutier I still cannot make it work. Do you have a solution?

kdgutier commented 1 year ago

Hey @xiao-he,

Would you be able to check the versions of pytorch lightning and ray? pytorch-lightning==1.6.5 ray[tune]==2.0.1

I recommend you to run momentarily the experiments on the Google Colab GPU, they are reasonably fast. https://colab.research.google.com/github/Nixtla/neuralforecast/blob/main/nbs/examples/LongHorizon_with_NHITS.ipynb

I will take a closer look to it. For the moment our tests only include latest ubuntu OS. If you are running the experiment on an cloud instance you might want to try Ubuntu.

xiao-he commented 1 year ago

@kdgutier Yes, I have pytorch-lightning==1.6.5 ray[tune]==2.0.1.

Thank you, I already use the GPU to run the experiments. I will try Ubuntu.

kdgutier commented 1 year ago

From the logs, of your error it seems that tune is failing to add the 'loss' to the result dictionary. A similar issue has also been reported to hugging face, pytorch lightning and keras:

The ray.tune.integration.pytorch_lightning, TuneCallback class is Ray 2.1.0 addition. https://docs.ray.io/en/latest/_modules/ray/tune/integration/pytorch_lightning.html We blocked the version because we were having some problems with some hyperparameter optimization backends.

xiao-he commented 1 year ago

@kdgutier It works with Ubuntu. Thanks.

kdgutier commented 1 year ago

@xiao-he, glad to hear that.

Nixtla / neuralforecast

ValueError: Trial returned a result which did not include the specified metric(s) `loss` that `tune.TuneConfig()` expects. while #340