Closed xiao-he closed 1 year ago
Hi @xiao-he,
Thanks for reporting this. Would you be able to provide a bit more information on the bug, for me to take a closer look?
Hi @kdgutier,
Thanks for the reply. Here are the information to reproduce the bug:
OS: CentOS 8.3
Setup: conda create --name test python=3.8 conda activate test pip install neuralforecast datasetsforecast matplotlib
Python script (copied from https://colab.research.google.com/github/Nixtla/neuralforecast/blob/main/nbs/examples/LongHorizon_with_NHITS.ipynb): import torch import numpy as np import pandas as pd import matplotlib.pyplot as plt from ray import tune from neuralforecast.auto import AutoNHITS from neuralforecast.core import NeuralForecast from neuralforecast.losses.pytorch import MAE from neuralforecast.losses.numpy import mae, mse from datasetsforecast.long_horizon import LongHorizon
import logging
logging.getLogger("pytorch_lightning").setLevel(logging.WARNING)
# Change this to your own data to try the model
Y_df, _, _ = LongHorizon.load(directory='./', group='ETTm2')
Y_df['ds'] = pd.to_datetime(Y_df['ds'])
# For this excercise we are going to take 20% of the DataSet
n_time = len(Y_df.ds.unique())
val_size = int(.2 * n_time)
test_size = int(.2 * n_time)
# We are going to plot the temperature of the transformer
# and marking the validation and train splits
u_id = 'HUFL'
x_plot = pd.to_datetime(Y_df[Y_df.unique_id==u_id].ds)
y_plot = Y_df[Y_df.unique_id==u_id].y.values
x_val = x_plot[n_time - val_size - test_size]
x_test = x_plot[n_time - test_size]
horizon = 96 # 24hrs = 4 * 15 min.
# Use your own config or AutoNHITS.default_config
nhits_config = {
"learning_rate": tune.choice([1e-3]), # Initial Learning rate
"max_steps": tune.choice([1000]), # Number of SGD steps
"input_size": tune.choice([5 * horizon]), # input_size = multiplier * horizon
"batch_size": tune.choice([7]), # Number of series in windows
"windows_batch_size": tune.choice([256]), # Number of windows in batch
"n_pool_kernel_size": tune.choice([[2, 2, 2], [16, 8, 1]]), # MaxPool's Kernelsize
"n_freq_downsample": tune.choice([[168, 24, 1], [24, 12, 1], [1, 1, 1]]), # Interpolation expressivity ratios
"activation": tune.choice(['ReLU']), # Type of non-linear activation
"n_blocks": tune.choice([[1, 1, 1]]), # Blocks per each 3 stacks
"mlp_units": tune.choice([[[512, 512], [512, 512], [512, 512]]]), # 2 512-Layers per block for each stack
"interpolation_mode": tune.choice(['linear']), # Type of multi-step interpolation
"random_seed": tune.randint(1, 10),
}
# Fit and predict
fcst = NeuralForecast(
models=[AutoNHITS(h=horizon, config=nhits_config,
num_samples=1)], # control of hyperopt samples
freq='15min')
fcst_df = fcst.cross_validation(df=Y_df, val_size=val_size,
test_size=test_size, n_windows=None)
Error:
Traceback (most recent call last):
File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/tuner.py", line 234, in fit
return self._local_tuner.fit()
File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/impl/tuner_internal.py", line 283, in fit
analysis = self._fit_internal(trainable, param_space)
File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/impl/tuner_internal.py", line 380, in _fit_internal
analysis = run(
File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/tune.py", line 722, in run
runner.step()
File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/execution/trial_runner.py", line 879, in step
self._wait_and_handle_event(next_trial)
File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/execution/trial_runner.py", line 858, in _wait_and_handle_event
raise TuneError(traceback.format_exc())
ray.tune.error.TuneError: Traceback (most recent call last):
File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/execution/trial_runner.py", line 846, in _wait_and_handle_event
self._on_training_result(
File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/execution/trial_runner.py", line 971, in _on_training_result
self._process_trial_results(trial, result)
File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/execution/trial_runner.py", line 1055, in _process_trial_results
decision = self._process_trial_result(trial, result)
File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/execution/trial_runner.py", line 1092, in _process_trial_result
self._validate_result_metrics(flat_result)
File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/execution/trial_runner.py", line 1188, in _validate_result_metrics
raise ValueError(
ValueError: Trial returned a result which did not include the specified metric(s) loss
that tune.TuneConfig()
expects. Make sure your calls to tune.report()
include the metric, or set the TUNE_DISABLE_STRICT_METRIC_CHECKING environment variable to 1. Result: {'trial_id': '3999c_00000', 'experiment_id': 'b5365bf9bc8c4630b158e2d2b9234b57', 'date': '2022-11-30_09-21-00', 'timestamp': 1669771260, 'pid': 836053, 'hostname': 'iZbp1c50jxlq7oez1iedf0Z', 'node_ip': '172.20.82.219', 'done': True, 'config/learning_rate': 0.001, 'config/max_steps': 5, 'config/input_size': 480, 'config/batch_size': 7, 'config/windows_batch_size': 256, 'config/n_pool_kernel_size': [16, 8, 1], 'config/n_freq_downsample': [168, 24, 1], 'config/activation': 'ReLU', 'config/n_blocks': [1, 1, 1], 'config/mlp_units': [[512, 512], [512, 512], [512, 512]], 'config/interpolation_mode': 'linear', 'config/random_seed': 5, 'config/loss': MAE(), 'config/h': 96}
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "test_nhits.py", line 62, in <module>
fcst_df = fcst.cross_validation(df=Y_df, val_size=val_size,
File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/neuralforecast/core.py", line 262, in cross_validation
model.fit(dataset=self.dataset,
File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/neuralforecast/common/_base_auto.py", line 152, in fit
results = tune_model(
File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/neuralforecast/common/_base_auto.py", line 78, in tune_model
results = tuner.fit()
File "/home/xiao/anaconda3/envs/test/lib/python3.8/site-packages/ray/tune/tuner.py", line 236, in fit
raise TuneError(
ray.tune.error.TuneError: Tune run failed. Please use tuner = Tuner.restore("/home/xiao/ray_results/train_tune_2022-11-30_09-20-55") to resume.
From a quick search I found a similar error in Keras:
They point out to the TuneReportCallback
class to be used as part of the Pytorch lightning trainer.
I can continue to take a closer look.
@kdgutier I still cannot make it work. Do you have a solution?
Hey @xiao-he,
Would you be able to check the versions of pytorch lightning and ray?
pytorch-lightning==1.6.5 ray[tune]==2.0.1
I recommend you to run momentarily the experiments on the Google Colab GPU, they are reasonably fast. https://colab.research.google.com/github/Nixtla/neuralforecast/blob/main/nbs/examples/LongHorizon_with_NHITS.ipynb
I will take a closer look to it. For the moment our tests only include latest ubuntu OS. If you are running the experiment on an cloud instance you might want to try Ubuntu.
@kdgutier Yes, I have pytorch-lightning==1.6.5 ray[tune]==2.0.1.
Thank you, I already use the GPU to run the experiments. I will try Ubuntu.
From the logs, of your error it seems that tune is failing to add the 'loss' to the result dictionary. A similar issue has also been reported to hugging face, pytorch lightning and keras:
The ray.tune.integration.pytorch_lightning
, TuneCallback
class is Ray 2.1.0 addition.
https://docs.ray.io/en/latest/_modules/ray/tune/integration/pytorch_lightning.html
We blocked the version because we were having some problems with some hyperparameter optimization backends.
@kdgutier It works with Ubuntu. Thanks.
@xiao-he, glad to hear that.
Does anybody get "ValueError: Trial returned a result which did not include the specified metric(s)
loss
thattune.TuneConfig()
expects. while " while using the notebook of LongHorizon_with_NHITS.ipynb?