Nixtla / neuralforecast

Scalable and user friendly neural :brain: forecasting algorithms.
https://nixtlaverse.nixtla.io/neuralforecast
Apache License 2.0
2.69k stars 312 forks source link

Getting no best trial found for the error metric for daily granularity [<Library component: Model|Core|etc...>] #1010

Closed sm-ak-r33 closed 1 month ago

sm-ak-r33 commented 1 month ago

What happened + What you expected to happen

I am trying to run this code on a daily granularity of 240 timeseries, but I am getting some error saying that there is no optimal metric for error found and they are all zero, nontheless it worked fine for the same dataset when I transformed it to monthly and weekly

Versions / Dependencies

%pip install neuralforecast "torch<2.0.0"

Reproduction script

%pip install "flaml[automl]"

from flaml import AutoML import pandas as pd from sklearn.model_selection import TimeSeriesSplit

class EnsembleModelTrainer: def init(self, dataframe, unique_id_column, target_column, horizon): self.df = dataframe self.unique_id_column = unique_id_column self.target_column = target_column self.horizon = horizon self.models = {}

def fit_models(self):
    unique_ids = self.df[self.unique_id_column].unique()
    estimator_list = ["lgbm", "rf", "xgboost"]  # Define the list of estimators
    for uid in unique_ids:
        df_filtered = self.df[self.df[self.unique_id_column] == uid]
        df_filtered = df_filtered.drop(columns=[self.unique_id_column])  # Assuming no additional columns are needed for modeling

        # Initialize models for each estimator
        for estimator in estimator_list:
            self.models[(uid, estimator)] = AutoML()

        automl_settings = {
            "time_budget": 10, # in seconds
            "metric": 'mae',
            "task": 'regression',
            "eval_method": 'auto',
            "model_history": True
        }

        # Splitting data using TimeSeriesSplit with rolling window of 5
        custom_cv = self.custom_cv_generator(df_filtered)
        for train_index, test_index in custom_cv.split(df_filtered):
            X_train, X_test = df_filtered.iloc[train_index], df_filtered.iloc[test_index]
            y_train = X_train[self.target_column]
            X_train = X_train.drop(columns=[self.target_column])
            for estimator in estimator_list:
                self.models[(uid, estimator)].fit(X_train=X_train, y_train=y_train, estimator_list=[estimator], **automl_settings)

def custom_cv_generator(self, df):
    tscv = TimeSeriesSplit(n_splits=(len(df) - self.horizon) // 5)
    return tscv

def predict(self, dataframe):
    results = pd.DataFrame(columns=['unique_id'] + [f'forecast_{estimator}' for estimator in ["lgbm", "rf", "xgboost"]])
    for uid in self.df[self.unique_id_column].unique():
        df_filtered = dataframe[dataframe[self.unique_id_column] == uid]
        df_filtered = df_filtered.drop(columns=[self.unique_id_column])  # Exclude ID for prediction
        last_rows = df_filtered.tail(self.horizon)  # Get the last 'horizon' rows for each UID
        forecast_dict = {}
        for estimator in ["lgbm", "rf", "xgboost"]:
            automl = self.models[(uid, estimator)]
            forecast = []
            for i in range(self.horizon):
                prediction = automl.predict(last_rows.iloc[[i]])[0]
                forecast.append(prediction)
            forecast_dict[f'forecast_{estimator}'] = forecast
        forecast_dict['unique_id'] = [uid] * self.horizon
        uid_results = pd.DataFrame(forecast_dict)
        results = pd.concat([results, uid_results], ignore_index=True, sort=False)
    return results

horizon = 365 config = dict(max_steps=100, val_check_steps=1, input_size=-1)

Configure models

models = [AutoLSTM(h=horizon,config=config, num_samples=20), AutoRNN(h=horizon,config=config, num_samples=20) ]

Initialize NeuralForecast

nf = NeuralForecast(models=models, freq='D' )

Fit model

nf.fit(df=df_daily, val_size=365, sort_df=True, verbose=True) Y_hat_df = nf.predict() Y_hat_df = Y_hat_df.reset_index() Y_hat_df.to_csv('./daily_neuralforecast.csv')

Issue Severity

High: It blocks me from completing my task.

sm-ak-r33 commented 1 month ago

full error here:- usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/parsing.py:199: Attribute 'loss' is an instance of nn.Module and is already saved during checkpointing. It is recommended to ignore them using self.save_hyperparameters(ignore=['loss']). /usr/lib/python3.10/subprocess.py:1796: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = _posixsubprocess.fork_exec( 2024-05-15 18:47:14,081 INFO worker.py:1749 -- Started a local Ray instance. 2024-05-15 18:47:15,647 INFO tune.py:253 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call ray.init(...) before Tuner(...). +--------------------------------------------------------------------+ | Configuration for experiment _train_tune_2024-05-15_18-47-11 | +--------------------------------------------------------------------+ | Search algorithm BasicVariantGenerator | | Scheduler FIFOScheduler | | Number of trials 20 | +--------------------------------------------------------------------+

View detailed results here: /root/ray_results/_train_tune_2024-05-15_18-47-11 To visualize your results with TensorBoard, run: tensorboard --logdir /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts (_train_tune pid=1995) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=1995) Seed set to 1 (_train_tune pid=1995) GPU available: True (cuda), used: True (_train_tune pid=1995) TPU available: False, using: 0 TPU cores (_train_tune pid=1995) IPU available: False, using: 0 IPUs (_train_tune pid=1995) HPU available: False, using: 0 HPUs (_train_tune pid=1995) Trainer(val_check_interval=1) was configured so validation will run after every batch. (_train_tune pid=1995) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00000_0_2024-05-15_18-47-16/lightning_logs (_train_tune pid=1995) 2024-05-15 18:47:26.258578: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered (_train_tune pid=1995) 2024-05-15 18:47:26.258632: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered (_train_tune pid=1995) 2024-05-15 18:47:26.393991: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered (_train_tune pid=1995) 2024-05-15 18:47:27.873538: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT (_train_tune pid=1995) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] (_train_tune pid=1995) (_train_tune pid=1995) | Name | Type | Params (_train_tune pid=1995) -------------------------------------------------- (_train_tune pid=1995) 0 | loss | MAE | 0
(_train_tune pid=1995) 1 | padder | ConstantPad1d | 0
(_train_tune pid=1995) 2 | scaler | TemporalNorm | 0
(_train_tune pid=1995) 3 | hist_encoder | LSTM | 484 K (_train_tune pid=1995) 4 | context_adapter | Linear | 733 K (_train_tune pid=1995) 5 | mlp_decoder | MLP | 2.4 K (_train_tune pid=1995) -------------------------------------------------- (_train_tune pid=1995) 1.2 M Trainable params (_train_tune pid=1995) 0 Non-trainable params (_train_tune pid=1995) 1.2 M Total params (_train_tune pid=1995) 4.880 Total estimated model params size (MB) Sanity Checking: | | 0/? [00:00<?, ?it/s] Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s] 2024-05-15 18:47:30,505 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00000 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=1995, ip=172.28.0.12, actor_id=5c789dcf4a6d166909d4404d01000000, repr=_train_tune) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run self._ret = self._target(self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in training_func=lambda: self._trainable_func(self.config), File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func output = fn() File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/util.py", line 130, in inner return trainable(config, fn_kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 209, in _traintune = self._fit_model( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 357, in _fit_model model = model.fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 530, in fit return self._fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_model.py", line 219, in _fit trainer.fit(model, datamodule=datamodule) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_stage self._run_sanity_check() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1060, in _run_sanity_check val_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_args) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 412, in validation_step return self.lightning_module.validation_step(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 392, in validation_step output = self(windows_batch) # tuple([B, seq_len, H, output]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/models/lstm.py", line 210, in forward output = self.mlp_decoder(context) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_modules.py", line 60, in forward return self.layers(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 101, in forward return F.relu(input, inplace=self.inplace) File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1473, in relu result = torch.relu(input) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.72 GiB. GPU 0 has a total capacity of 14.75 GiB of which 1.07 GiB is free. Process 20522 has 13.68 GiB memory in use. Of the allocated memory 13.54 GiB is allocated by PyTorch, and 10.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Trial _train_tune_4003e_00000 errored after 0 iterations at 2024-05-15 18:47:30. Total running time: 14s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00000_0_2024-05-15_18-47-16/error.txt (_train_tune pid=2110) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=2110) Seed set to 1 (_train_tune pid=2110) GPU available: True (cuda), used: True (_train_tune pid=2110) TPU available: False, using: 0 TPU cores (_train_tune pid=2110) IPU available: False, using: 0 IPUs (_train_tune pid=2110) HPU available: False, using: 0 HPUs (_train_tune pid=2110) Trainer(val_check_interval=1) was configured so validation will run after every batch. (_train_tune pid=2110) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00001_1_2024-05-15_18-47-16/lightning_logs (_train_tune pid=2110) 2024-05-15 18:47:39.421817: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered (_train_tune pid=2110) 2024-05-15 18:47:39.421885: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered (_train_tune pid=2110) 2024-05-15 18:47:39.423953: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered (_train_tune pid=2110) 2024-05-15 18:47:41.021553: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT (_train_tune pid=2110) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] (_train_tune pid=2110) (_train_tune pid=2110) | Name | Type | Params (_train_tune pid=2110) -------------------------------------------------- (_train_tune pid=2110) 0 | loss | MAE | 0
(_train_tune pid=2110) 1 | padder | ConstantPad1d | 0
(_train_tune pid=2110) 2 | scaler | TemporalNorm | 0
(_train_tune pid=2110) 3 | hist_encoder | LSTM | 484 K (_train_tune pid=2110) 4 | context_adapter | Linear | 733 K (_train_tune pid=2110) 5 | mlp_decoder | MLP | 2.4 K (_train_tune pid=2110) -------------------------------------------------- (_train_tune pid=2110) 1.2 M Trainable params (_train_tune pid=2110) 0 Non-trainable params (_train_tune pid=2110) 1.2 M Total params (_train_tune pid=2110) 4.880 Total estimated model params size (MB) Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s] 2024-05-15 18:47:43,063 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00001 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=2110, ip=172.28.0.12, actor_id=2838f0110175210a4229689701000000, repr=_train_tune) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run self._ret = self._target(self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in training_func=lambda: self._trainable_func(self.config), File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func output = fn() File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/util.py", line 130, in inner return trainable(config, fn_kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 209, in _traintune = self._fit_model( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 357, in _fit_model model = model.fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 530, in fit return self._fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_model.py", line 219, in _fit trainer.fit(model, datamodule=datamodule) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_stage self._run_sanity_check() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1060, in _run_sanity_check val_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_args) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 412, in validation_step return self.lightning_module.validation_step(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 392, in validation_step output = self(windows_batch) # tuple([B, seq_len, H, output]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/models/lstm.py", line 210, in forward output = self.mlp_decoder(context) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_modules.py", line 60, in forward return self.layers(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 101, in forward return F.relu(input, inplace=self.inplace) File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1473, in relu result = torch.relu(input) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.72 GiB. GPU 0 has a total capacity of 14.75 GiB of which 1.07 GiB is free. Process 21726 has 13.68 GiB memory in use. Of the allocated memory 13.54 GiB is allocated by PyTorch, and 10.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Trial _train_tune_4003e_00001 errored after 0 iterations at 2024-05-15 18:47:43. Total running time: 27s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00001_1_2024-05-15_18-47-16/error.txt (_train_tune pid=2198) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=2198) Seed set to 1 (_train_tune pid=2198) GPU available: True (cuda), used: True (_train_tune pid=2198) TPU available: False, using: 0 TPU cores (_train_tune pid=2198) IPU available: False, using: 0 IPUs (_train_tune pid=2198) HPU available: False, using: 0 HPUs (_train_tune pid=2198) Trainer(val_check_interval=1) was configured so validation will run after every batch. (_train_tune pid=2198) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00002_2_2024-05-15_18-47-16/lightning_logs (_train_tune pid=2198) 2024-05-15 18:47:50.035350: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered (_train_tune pid=2198) 2024-05-15 18:47:50.035407: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered (_train_tune pid=2198) 2024-05-15 18:47:50.037270: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered (_train_tune pid=2198) 2024-05-15 18:47:51.939885: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT (_train_tune pid=2198) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] (_train_tune pid=2198) (_train_tune pid=2198) | Name | Type | Params (_train_tune pid=2198) -------------------------------------------------- (_train_tune pid=2198) 0 | loss | MAE | 0
(_train_tune pid=2198) 1 | padder | ConstantPad1d | 0
(_train_tune pid=2198) 2 | scaler | TemporalNorm | 0
(_train_tune pid=2198) 3 | hist_encoder | LSTM | 484 K (_train_tune pid=2198) 4 | context_adapter | Linear | 733 K (_train_tune pid=2198) 5 | mlp_decoder | MLP | 2.4 K (_train_tune pid=2198) -------------------------------------------------- (_train_tune pid=2198) 1.2 M Trainable params (_train_tune pid=2198) 0 Non-trainable params (_train_tune pid=2198) 1.2 M Total params (_train_tune pid=2198) 4.880 Total estimated model params size (MB) Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s] 2024-05-15 18:47:54,727 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00002 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=2198, ip=172.28.0.12, actor_id=c714b39740eeafb5a49fc4ae01000000, repr=_train_tune) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run self._ret = self._target(self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in training_func=lambda: self._trainable_func(self.config), File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func output = fn() File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/util.py", line 130, in inner return trainable(config, fn_kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 209, in _traintune = self._fit_model( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 357, in _fit_model model = model.fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 530, in fit return self._fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_model.py", line 219, in _fit trainer.fit(model, datamodule=datamodule) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_stage self._run_sanity_check() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1060, in _run_sanity_check val_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_args) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 412, in validation_step return self.lightning_module.validation_step(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 392, in validation_step output = self(windows_batch) # tuple([B, seq_len, H, output]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/models/lstm.py", line 210, in forward output = self.mlp_decoder(context) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_modules.py", line 60, in forward return self.layers(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 101, in forward return F.relu(input, inplace=self.inplace) File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1473, in relu result = torch.relu(input) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.72 GiB. GPU 0 has a total capacity of 14.75 GiB of which 1.07 GiB is free. Process 22900 has 13.68 GiB memory in use. Of the allocated memory 13.54 GiB is allocated by PyTorch, and 10.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Trial _train_tune_4003e_00002 errored after 0 iterations at 2024-05-15 18:47:54. Total running time: 38s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00002_2_2024-05-15_18-47-16/error.txt (_train_tune pid=2288) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=2288) Seed set to 1 (_train_tune pid=2288) GPU available: True (cuda), used: True (_train_tune pid=2288) TPU available: False, using: 0 TPU cores (_train_tune pid=2288) IPU available: False, using: 0 IPUs (_train_tune pid=2288) HPU available: False, using: 0 HPUs (_train_tune pid=2288) Trainer(val_check_interval=1) was configured so validation will run after every batch. (_train_tune pid=2288) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00003_3_2024-05-15_18-47-16/lightning_logs (_train_tune pid=2288) 2024-05-15 18:48:03.697816: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered (_train_tune pid=2288) 2024-05-15 18:48:03.697868: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered (_train_tune pid=2288) 2024-05-15 18:48:03.699360: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered (_train_tune pid=2288) 2024-05-15 18:48:05.019658: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT (_train_tune pid=2288) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] (_train_tune pid=2288) (_train_tune pid=2288) | Name | Type | Params (_train_tune pid=2288) -------------------------------------------------- (_train_tune pid=2288) 0 | loss | MAE | 0
(_train_tune pid=2288) 1 | padder | ConstantPad1d | 0
(_train_tune pid=2288) 2 | scaler | TemporalNorm | 0
(_train_tune pid=2288) 3 | hist_encoder | LSTM | 484 K (_train_tune pid=2288) 4 | context_adapter | Linear | 733 K (_train_tune pid=2288) 5 | mlp_decoder | MLP | 2.4 K (_train_tune pid=2288) -------------------------------------------------- (_train_tune pid=2288) 1.2 M Trainable params (_train_tune pid=2288) 0 Non-trainable params (_train_tune pid=2288) 1.2 M Total params (_train_tune pid=2288) 4.880 Total estimated model params size (MB) Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s] 2024-05-15 18:48:07,684 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00003 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=2288, ip=172.28.0.12, actor_id=1d441b0c09369e41859fc92f01000000, repr=_train_tune) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run self._ret = self._target(self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in training_func=lambda: self._trainable_func(self.config), File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func output = fn() File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/util.py", line 130, in inner return trainable(config, fn_kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 209, in _traintune = self._fit_model( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 357, in _fit_model model = model.fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 530, in fit return self._fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_model.py", line 219, in _fit trainer.fit(model, datamodule=datamodule) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_stage self._run_sanity_check() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1060, in _run_sanity_check val_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_args) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 412, in validation_step return self.lightning_module.validation_step(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 392, in validation_step output = self(windows_batch) # tuple([B, seq_len, H, output]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/models/lstm.py", line 210, in forward output = self.mlp_decoder(context) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_modules.py", line 60, in forward return self.layers(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 101, in forward return F.relu(input, inplace=self.inplace) File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1473, in relu result = torch.relu(input) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.72 GiB. GPU 0 has a total capacity of 14.75 GiB of which 1.07 GiB is free. Process 24088 has 13.68 GiB memory in use. Of the allocated memory 13.54 GiB is allocated by PyTorch, and 10.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Trial _train_tune_4003e_00003 errored after 0 iterations at 2024-05-15 18:48:07. Total running time: 51s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00003_3_2024-05-15_18-47-16/error.txt

(_train_tune pid=2381) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=2381) Seed set to 1 (_train_tune pid=2381) GPU available: True (cuda), used: True (_train_tune pid=2381) TPU available: False, using: 0 TPU cores (_train_tune pid=2381) IPU available: False, using: 0 IPUs (_train_tune pid=2381) HPU available: False, using: 0 HPUs (_train_tune pid=2381) Trainer(val_check_interval=1) was configured so validation will run after every batch. (_train_tune pid=2381) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00004_4_2024-05-15_18-47-16/lightning_logs (_train_tune pid=2381) 2024-05-15 18:48:16.700941: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered (_train_tune pid=2381) 2024-05-15 18:48:16.701004: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered (_train_tune pid=2381) 2024-05-15 18:48:16.702355: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered (_train_tune pid=2381) 2024-05-15 18:48:18.058275: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT (_train_tune pid=2381) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] (_train_tune pid=2381) (_train_tune pid=2381) | Name | Type | Params (_train_tune pid=2381) -------------------------------------------------- (_train_tune pid=2381) 0 | loss | MAE | 0
(_train_tune pid=2381) 1 | padder | ConstantPad1d | 0
(_train_tune pid=2381) 2 | scaler | TemporalNorm | 0
(_train_tune pid=2381) 3 | hist_encoder | LSTM | 484 K (_train_tune pid=2381) 4 | context_adapter | Linear | 733 K (_train_tune pid=2381) 5 | mlp_decoder | MLP | 2.4 K (_train_tune pid=2381) -------------------------------------------------- (_train_tune pid=2381) 1.2 M Trainable params (_train_tune pid=2381) 0 Non-trainable params (_train_tune pid=2381) 1.2 M Total params (_train_tune pid=2381) 4.880 Total estimated model params size (MB) Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s] 2024-05-15 18:48:20,080 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00004 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=2381, ip=172.28.0.12, actor_id=5404bc2a6c8eaf14c899cd8b01000000, repr=_train_tune) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run self._ret = self._target(self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in training_func=lambda: self._trainable_func(self.config), File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func output = fn() File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/util.py", line 130, in inner return trainable(config, fn_kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 209, in _traintune = self._fit_model( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 357, in _fit_model model = model.fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 530, in fit return self._fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_model.py", line 219, in _fit trainer.fit(model, datamodule=datamodule) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_stage self._run_sanity_check() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1060, in _run_sanity_check val_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_args) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 412, in validation_step return self.lightning_module.validation_step(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 392, in validation_step output = self(windows_batch) # tuple([B, seq_len, H, output]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/models/lstm.py", line 210, in forward output = self.mlp_decoder(context) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_modules.py", line 60, in forward return self.layers(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 101, in forward return F.relu(input, inplace=self.inplace) File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1473, in relu result = torch.relu(input) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.72 GiB. GPU 0 has a total capacity of 14.75 GiB of which 1.07 GiB is free. Process 25733 has 13.68 GiB memory in use. Of the allocated memory 13.54 GiB is allocated by PyTorch, and 10.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Trial _train_tune_4003e_00004 errored after 0 iterations at 2024-05-15 18:48:20. Total running time: 1min 4s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00004_4_2024-05-15_18-47-16/error.txt (_train_tune pid=2469) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=2469) Seed set to 1 (_train_tune pid=2469) GPU available: True (cuda), used: True (_train_tune pid=2469) TPU available: False, using: 0 TPU cores (_train_tune pid=2469) IPU available: False, using: 0 IPUs (_train_tune pid=2469) HPU available: False, using: 0 HPUs (_train_tune pid=2469) Trainer(val_check_interval=1) was configured so validation will run after every batch. (_train_tune pid=2469) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00005_5_2024-05-15_18-47-16/lightning_logs (_train_tune pid=2469) 2024-05-15 18:48:28.901342: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered (_train_tune pid=2469) 2024-05-15 18:48:28.901397: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered (_train_tune pid=2469) 2024-05-15 18:48:28.902753: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered (_train_tune pid=2469) 2024-05-15 18:48:30.243292: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT (_train_tune pid=2469) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] (_train_tune pid=2469) (_train_tune pid=2469) | Name | Type | Params (_train_tune pid=2469) -------------------------------------------------- (_train_tune pid=2469) 0 | loss | MAE | 0
(_train_tune pid=2469) 1 | padder | ConstantPad1d | 0
(_train_tune pid=2469) 2 | scaler | TemporalNorm | 0
(_train_tune pid=2469) 3 | hist_encoder | LSTM | 484 K (_train_tune pid=2469) 4 | context_adapter | Linear | 733 K (_train_tune pid=2469) 5 | mlp_decoder | MLP | 2.4 K (_train_tune pid=2469) -------------------------------------------------- (_train_tune pid=2469) 1.2 M Trainable params (_train_tune pid=2469) 0 Non-trainable params (_train_tune pid=2469) 1.2 M Total params (_train_tune pid=2469) 4.880 Total estimated model params size (MB) Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s] 2024-05-15 18:48:32,276 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00005 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=2469, ip=172.28.0.12, actor_id=94b4daf5bcf5e4bcb020ff4201000000, repr=_train_tune) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run self._ret = self._target(self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in training_func=lambda: self._trainable_func(self.config), File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func output = fn() File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/util.py", line 130, in inner return trainable(config, fn_kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 209, in _traintune = self._fit_model( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 357, in _fit_model model = model.fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 530, in fit return self._fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_model.py", line 219, in _fit trainer.fit(model, datamodule=datamodule) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_stage self._run_sanity_check() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1060, in _run_sanity_check val_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_args) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 412, in validation_step return self.lightning_module.validation_step(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 392, in validation_step output = self(windows_batch) # tuple([B, seq_len, H, output]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/models/lstm.py", line 210, in forward output = self.mlp_decoder(context) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_modules.py", line 60, in forward return self.layers(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 101, in forward return F.relu(input, inplace=self.inplace) File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1473, in relu result = torch.relu(input) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.72 GiB. GPU 0 has a total capacity of 14.75 GiB of which 1.07 GiB is free. Process 26559 has 13.68 GiB memory in use. Of the allocated memory 13.54 GiB is allocated by PyTorch, and 10.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Trial _train_tune_4003e_00005 errored after 0 iterations at 2024-05-15 18:48:32. Total running time: 1min 16s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00005_5_2024-05-15_18-47-16/error.txt (_train_tune pid=2555) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=2555) Seed set to 1 (_train_tune pid=2555) GPU available: True (cuda), used: True (_train_tune pid=2555) TPU available: False, using: 0 TPU cores (_train_tune pid=2555) IPU available: False, using: 0 IPUs (_train_tune pid=2555) HPU available: False, using: 0 HPUs (_train_tune pid=2555) Trainer(val_check_interval=1) was configured so validation will run after every batch. (_train_tune pid=2555) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00006_6_2024-05-15_18-47-16/lightning_logs (_train_tune pid=2555) 2024-05-15 18:48:40.138108: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered (_train_tune pid=2555) 2024-05-15 18:48:40.138193: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered (_train_tune pid=2555) 2024-05-15 18:48:40.140225: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered (_train_tune pid=2555) 2024-05-15 18:48:42.122800: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT (_train_tune pid=2555) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] (_train_tune pid=2555) (_train_tune pid=2555) | Name | Type | Params (_train_tune pid=2555) -------------------------------------------------- (_train_tune pid=2555) 0 | loss | MAE | 0
(_train_tune pid=2555) 1 | padder | ConstantPad1d | 0
(_train_tune pid=2555) 2 | scaler | TemporalNorm | 0
(_train_tune pid=2555) 3 | hist_encoder | LSTM | 484 K (_train_tune pid=2555) 4 | context_adapter | Linear | 733 K (_train_tune pid=2555) 5 | mlp_decoder | MLP | 2.4 K (_train_tune pid=2555) -------------------------------------------------- (_train_tune pid=2555) 1.2 M Trainable params (_train_tune pid=2555) 0 Non-trainable params (_train_tune pid=2555) 1.2 M Total params (_train_tune pid=2555) 4.880 Total estimated model params size (MB) Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s] 2024-05-15 18:48:44,144 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00006 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=2555, ip=172.28.0.12, actor_id=7dc8acbdaa622f3d1410253f01000000, repr=_train_tune) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run self._ret = self._target(self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in training_func=lambda: self._trainable_func(self.config), File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func output = fn() File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/util.py", line 130, in inner return trainable(config, fn_kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 209, in _traintune = self._fit_model( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 357, in _fit_model model = model.fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 530, in fit return self._fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_model.py", line 219, in _fit trainer.fit(model, datamodule=datamodule) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_stage self._run_sanity_check() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1060, in _run_sanity_check val_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_args) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 412, in validation_step return self.lightning_module.validation_step(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 392, in validation_step output = self(windows_batch) # tuple([B, seq_len, H, output]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/models/lstm.py", line 210, in forward output = self.mlp_decoder(context) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_modules.py", line 60, in forward return self.layers(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 101, in forward return F.relu(input, inplace=self.inplace) File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1473, in relu result = torch.relu(input) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.72 GiB. GPU 0 has a total capacity of 14.75 GiB of which 1.07 GiB is free. Process 27769 has 13.68 GiB memory in use. Of the allocated memory 13.54 GiB is allocated by PyTorch, and 10.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Trial _train_tune_4003e_00006 errored after 0 iterations at 2024-05-15 18:48:44. Total running time: 1min 28s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00006_6_2024-05-15_18-47-16/error.txt (_train_tune pid=2641) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=2641) Seed set to 1 (_train_tune pid=2641) GPU available: True (cuda), used: True (_train_tune pid=2641) TPU available: False, using: 0 TPU cores (_train_tune pid=2641) IPU available: False, using: 0 IPUs (_train_tune pid=2641) HPU available: False, using: 0 HPUs (_train_tune pid=2641) Trainer(val_check_interval=1) was configured so validation will run after every batch. (_train_tune pid=2641) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00007_7_2024-05-15_18-47-16/lightning_logs (_train_tune pid=2641) 2024-05-15 18:48:50.935377: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered (_train_tune pid=2641) 2024-05-15 18:48:50.935427: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered (_train_tune pid=2641) 2024-05-15 18:48:50.936953: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered (_train_tune pid=2641) 2024-05-15 18:48:52.507542: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT (_train_tune pid=2641) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] (_train_tune pid=2641) (_train_tune pid=2641) | Name | Type | Params (_train_tune pid=2641) -------------------------------------------------- (_train_tune pid=2641) 0 | loss | MAE | 0
(_train_tune pid=2641) 1 | padder | ConstantPad1d | 0
(_train_tune pid=2641) 2 | scaler | TemporalNorm | 0
(_train_tune pid=2641) 3 | hist_encoder | LSTM | 484 K (_train_tune pid=2641) 4 | context_adapter | Linear | 733 K (_train_tune pid=2641) 5 | mlp_decoder | MLP | 2.4 K (_train_tune pid=2641) -------------------------------------------------- (_train_tune pid=2641) 1.2 M Trainable params (_train_tune pid=2641) 0 Non-trainable params (_train_tune pid=2641) 1.2 M Total params (_train_tune pid=2641) 4.880 Total estimated model params size (MB) Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s] 2024-05-15 18:48:55,324 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00007 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=2641, ip=172.28.0.12, actor_id=cd65c6f74b72c63491d398b001000000, repr=_train_tune) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run self._ret = self._target(self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in training_func=lambda: self._trainable_func(self.config), File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func output = fn() File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/util.py", line 130, in inner return trainable(config, fn_kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 209, in _traintune = self._fit_model( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 357, in _fit_model model = model.fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 530, in fit return self._fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_model.py", line 219, in _fit trainer.fit(model, datamodule=datamodule) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_stage self._run_sanity_check() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1060, in _run_sanity_check val_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_args) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 412, in validation_step return self.lightning_module.validation_step(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 392, in validation_step output = self(windows_batch) # tuple([B, seq_len, H, output]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/models/lstm.py", line 210, in forward output = self.mlp_decoder(context) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_modules.py", line 60, in forward return self.layers(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 101, in forward return F.relu(input, inplace=self.inplace) File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1473, in relu result = torch.relu(input) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.72 GiB. GPU 0 has a total capacity of 14.75 GiB of which 1.07 GiB is free. Process 28983 has 13.68 GiB memory in use. Of the allocated memory 13.54 GiB is allocated by PyTorch, and 10.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Trial _train_tune_4003e_00007 errored after 0 iterations at 2024-05-15 18:48:55. Total running time: 1min 39s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00007_7_2024-05-15_18-47-16/error.txt

(_train_tune pid=2725) Seed set to 1 (_train_tune pid=2725) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=2725) GPU available: True (cuda), used: True (_train_tune pid=2725) TPU available: False, using: 0 TPU cores (_train_tune pid=2725) IPU available: False, using: 0 IPUs (_train_tune pid=2725) HPU available: False, using: 0 HPUs (_train_tune pid=2725) Trainer(val_check_interval=1) was configured so validation will run after every batch. (_train_tune pid=2725) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00008_8_2024-05-15_18-47-16/lightning_logs (_train_tune pid=2725) 2024-05-15 18:49:03.635740: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered (_train_tune pid=2725) 2024-05-15 18:49:03.635794: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered (_train_tune pid=2725) 2024-05-15 18:49:03.637208: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered (_train_tune pid=2725) 2024-05-15 18:49:04.955429: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT (_train_tune pid=2725) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] (_train_tune pid=2725) (_train_tune pid=2725) | Name | Type | Params (_train_tune pid=2725) -------------------------------------------------- (_train_tune pid=2725) 0 | loss | MAE | 0
(_train_tune pid=2725) 1 | padder | ConstantPad1d | 0
(_train_tune pid=2725) 2 | scaler | TemporalNorm | 0
(_train_tune pid=2725) 3 | hist_encoder | LSTM | 484 K (_train_tune pid=2725) 4 | context_adapter | Linear | 733 K (_train_tune pid=2725) 5 | mlp_decoder | MLP | 2.4 K (_train_tune pid=2725) -------------------------------------------------- (_train_tune pid=2725) 1.2 M Trainable params (_train_tune pid=2725) 0 Non-trainable params (_train_tune pid=2725) 1.2 M Total params (_train_tune pid=2725) 4.880 Total estimated model params size (MB) Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s] 2024-05-15 18:49:06,998 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00008 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=2725, ip=172.28.0.12, actor_id=fabdb55dfad9439e37e6f6ee01000000, repr=_train_tune) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run self._ret = self._target(self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in training_func=lambda: self._trainable_func(self.config), File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func output = fn() File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/util.py", line 130, in inner return trainable(config, fn_kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 209, in _traintune = self._fit_model( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 357, in _fit_model model = model.fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 530, in fit return self._fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_model.py", line 219, in _fit trainer.fit(model, datamodule=datamodule) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_stage self._run_sanity_check() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1060, in _run_sanity_check val_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_args) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 412, in validation_step return self.lightning_module.validation_step(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 392, in validation_step output = self(windows_batch) # tuple([B, seq_len, H, output]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/models/lstm.py", line 210, in forward output = self.mlp_decoder(context) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_modules.py", line 60, in forward return self.layers(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 101, in forward return F.relu(input, inplace=self.inplace) File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1473, in relu result = torch.relu(input) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.72 GiB. GPU 0 has a total capacity of 14.75 GiB of which 1.07 GiB is free. Process 29606 has 13.68 GiB memory in use. Of the allocated memory 13.54 GiB is allocated by PyTorch, and 10.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Trial _train_tune_4003e_00008 errored after 0 iterations at 2024-05-15 18:49:07. Total running time: 1min 51s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00008_8_2024-05-15_18-47-16/error.txt

(_train_tune pid=2812) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=2812) Seed set to 1 (_train_tune pid=2812) GPU available: True (cuda), used: True (_train_tune pid=2812) TPU available: False, using: 0 TPU cores (_train_tune pid=2812) IPU available: False, using: 0 IPUs (_train_tune pid=2812) HPU available: False, using: 0 HPUs (_train_tune pid=2812) Trainer(val_check_interval=1) was configured so validation will run after every batch. (_train_tune pid=2812) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00009_9_2024-05-15_18-47-16/lightning_logs (_train_tune pid=2812) 2024-05-15 18:49:17.170487: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered (_train_tune pid=2812) 2024-05-15 18:49:17.170542: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered (_train_tune pid=2812) 2024-05-15 18:49:17.171997: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered (_train_tune pid=2812) 2024-05-15 18:49:18.485404: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT (_train_tune pid=2812) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] (_train_tune pid=2812) (_train_tune pid=2812) | Name | Type | Params (_train_tune pid=2812) -------------------------------------------------- (_train_tune pid=2812) 0 | loss | MAE | 0
(_train_tune pid=2812) 1 | padder | ConstantPad1d | 0
(_train_tune pid=2812) 2 | scaler | TemporalNorm | 0
(_train_tune pid=2812) 3 | hist_encoder | LSTM | 484 K (_train_tune pid=2812) 4 | context_adapter | Linear | 733 K (_train_tune pid=2812) 5 | mlp_decoder | MLP | 2.4 K (_train_tune pid=2812) -------------------------------------------------- (_train_tune pid=2812) 1.2 M Trainable params (_train_tune pid=2812) 0 Non-trainable params (_train_tune pid=2812) 1.2 M Total params (_train_tune pid=2812) 4.880 Total estimated model params size (MB) Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s] 2024-05-15 18:49:20,526 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00009 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=2812, ip=172.28.0.12, actor_id=3c373755967d0aafb25b94ec01000000, repr=_train_tune) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run self._ret = self._target(self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in training_func=lambda: self._trainable_func(self.config), File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func output = fn() File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/util.py", line 130, in inner return trainable(config, fn_kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 209, in _traintune = self._fit_model( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 357, in _fit_model model = model.fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 530, in fit return self._fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_model.py", line 219, in _fit trainer.fit(model, datamodule=datamodule) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_stage self._run_sanity_check() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1060, in _run_sanity_check val_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_args) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 412, in validation_step return self.lightning_module.validation_step(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 392, in validation_step output = self(windows_batch) # tuple([B, seq_len, H, output]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/models/lstm.py", line 210, in forward output = self.mlp_decoder(context) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_modules.py", line 60, in forward return self.layers(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 101, in forward return F.relu(input, inplace=self.inplace) File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1473, in relu result = torch.relu(input) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.72 GiB. GPU 0 has a total capacity of 14.75 GiB of which 1.07 GiB is free. Process 30380 has 13.68 GiB memory in use. Of the allocated memory 13.54 GiB is allocated by PyTorch, and 10.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Trial _train_tune_4003e_00009 errored after 0 iterations at 2024-05-15 18:49:20. Total running time: 2min 4s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00009_9_2024-05-15_18-47-16/error.txt (_train_tune pid=2908) Seed set to 1 (_train_tune pid=2908) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=2908) GPU available: True (cuda), used: True (_train_tune pid=2908) TPU available: False, using: 0 TPU cores (_train_tune pid=2908) IPU available: False, using: 0 IPUs (_train_tune pid=2908) HPU available: False, using: 0 HPUs (_train_tune pid=2908) Trainer(val_check_interval=1) was configured so validation will run after every batch. (_train_tune pid=2908) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00010_10_2024-05-15_18-47-16/lightning_logs (_train_tune pid=2908) 2024-05-15 18:49:29.842498: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered (_train_tune pid=2908) 2024-05-15 18:49:29.842555: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered (_train_tune pid=2908) 2024-05-15 18:49:29.844072: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered (_train_tune pid=2908) 2024-05-15 18:49:31.184552: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT (_train_tune pid=2908) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] (_train_tune pid=2908) (_train_tune pid=2908) | Name | Type | Params (_train_tune pid=2908) -------------------------------------------------- (_train_tune pid=2908) 0 | loss | MAE | 0
(_train_tune pid=2908) 1 | padder | ConstantPad1d | 0
(_train_tune pid=2908) 2 | scaler | TemporalNorm | 0
(_train_tune pid=2908) 3 | hist_encoder | LSTM | 484 K (_train_tune pid=2908) 4 | context_adapter | Linear | 733 K (_train_tune pid=2908) 5 | mlp_decoder | MLP | 2.4 K (_train_tune pid=2908) -------------------------------------------------- (_train_tune pid=2908) 1.2 M Trainable params (_train_tune pid=2908) 0 Non-trainable params (_train_tune pid=2908) 1.2 M Total params (_train_tune pid=2908) 4.880 Total estimated model params size (MB) Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s] 2024-05-15 18:49:33,208 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00010 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=2908, ip=172.28.0.12, actor_id=75e6cd94f8d65d7463d4795301000000, repr=_train_tune) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run self._ret = self._target(self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in training_func=lambda: self._trainable_func(self.config), File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func output = fn() File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/util.py", line 130, in inner return trainable(config, fn_kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 209, in _traintune = self._fit_model( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 357, in _fit_model model = model.fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 530, in fit return self._fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_model.py", line 219, in _fit trainer.fit(model, datamodule=datamodule) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_stage self._run_sanity_check() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1060, in _run_sanity_check val_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_args) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 412, in validation_step return self.lightning_module.validation_step(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 392, in validation_step output = self(windows_batch) # tuple([B, seq_len, H, output]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/models/lstm.py", line 210, in forward output = self.mlp_decoder(context) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_modules.py", line 60, in forward return self.layers(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 101, in forward return F.relu(input, inplace=self.inplace) File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1473, in relu result = torch.relu(input) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.72 GiB. GPU 0 has a total capacity of 14.75 GiB of which 1.07 GiB is free. Process 31620 has 13.68 GiB memory in use. Of the allocated memory 13.54 GiB is allocated by PyTorch, and 10.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Trial _train_tune_4003e_00010 errored after 0 iterations at 2024-05-15 18:49:33. Total running time: 2min 17s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00010_10_2024-05-15_18-47-16/error.txt (_train_tune pid=2993) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=2993) Seed set to 1 (_train_tune pid=2993) GPU available: True (cuda), used: True (_train_tune pid=2993) TPU available: False, using: 0 TPU cores (_train_tune pid=2993) IPU available: False, using: 0 IPUs (_train_tune pid=2993) HPU available: False, using: 0 HPUs (_train_tune pid=2993) Trainer(val_check_interval=1) was configured so validation will run after every batch. (_train_tune pid=2993) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00011_11_2024-05-15_18-47-16/lightning_logs (_train_tune pid=2993) 2024-05-15 18:49:40.330171: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered (_train_tune pid=2993) 2024-05-15 18:49:40.330237: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered (_train_tune pid=2993) 2024-05-15 18:49:40.333261: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered (_train_tune pid=2993) 2024-05-15 18:49:42.299129: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT (_train_tune pid=2993) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] (_train_tune pid=2993) (_train_tune pid=2993) | Name | Type | Params (_train_tune pid=2993) -------------------------------------------------- (_train_tune pid=2993) 0 | loss | MAE | 0
(_train_tune pid=2993) 1 | padder | ConstantPad1d | 0
(_train_tune pid=2993) 2 | scaler | TemporalNorm | 0
(_train_tune pid=2993) 3 | hist_encoder | LSTM | 484 K (_train_tune pid=2993) 4 | context_adapter | Linear | 733 K (_train_tune pid=2993) 5 | mlp_decoder | MLP | 2.4 K (_train_tune pid=2993) -------------------------------------------------- (_train_tune pid=2993) 1.2 M Trainable params (_train_tune pid=2993) 0 Non-trainable params (_train_tune pid=2993) 1.2 M Total params (_train_tune pid=2993) 4.880 Total estimated model params size (MB) Sanity Checking: | | 0/? [00:00<?, ?it/s] Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s] 2024-05-15 18:49:45,101 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00011 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=2993, ip=172.28.0.12, actor_id=4ea9d8dd080bbcba516e06d601000000, repr=_train_tune) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run self._ret = self._target(self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in training_func=lambda: self._trainable_func(self.config), File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func output = fn() File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/util.py", line 130, in inner return trainable(config, fn_kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 209, in _traintune = self._fit_model( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 357, in _fit_model model = model.fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 530, in fit return self._fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_model.py", line 219, in _fit trainer.fit(model, datamodule=datamodule) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_stage self._run_sanity_check() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1060, in _run_sanity_check val_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_args) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 412, in validation_step return self.lightning_module.validation_step(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 392, in validation_step output = self(windows_batch) # tuple([B, seq_len, H, output]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/models/lstm.py", line 210, in forward output = self.mlp_decoder(context) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_modules.py", line 60, in forward return self.layers(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 101, in forward return F.relu(input, inplace=self.inplace) File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1473, in relu result = torch.relu(input) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.72 GiB. GPU 0 has a total capacity of 14.75 GiB of which 1.07 GiB is free. Process 32828 has 13.68 GiB memory in use. Of the allocated memory 13.54 GiB is allocated by PyTorch, and 10.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Trial _train_tune_4003e_00011 errored after 0 iterations at 2024-05-15 18:49:45. Total running time: 2min 29s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00011_11_2024-05-15_18-47-16/error.txt (_train_tune pid=3081) Seed set to 1 (_train_tune pid=3081) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=3081) GPU available: True (cuda), used: True (_train_tune pid=3081) TPU available: False, using: 0 TPU cores (_train_tune pid=3081) IPU available: False, using: 0 IPUs (_train_tune pid=3081) HPU available: False, using: 0 HPUs (_train_tune pid=3081) Trainer(val_check_interval=1) was configured so validation will run after every batch. (_train_tune pid=3081) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00012_12_2024-05-15_18-47-16/lightning_logs (_train_tune pid=3081) 2024-05-15 18:49:52.043224: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered (_train_tune pid=3081) 2024-05-15 18:49:52.043281: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered (_train_tune pid=3081) 2024-05-15 18:49:52.044725: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered (_train_tune pid=3081) 2024-05-15 18:49:53.422987: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT (_train_tune pid=3081) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] (_train_tune pid=3081) (_train_tune pid=3081) | Name | Type | Params (_train_tune pid=3081) -------------------------------------------------- (_train_tune pid=3081) 0 | loss | MAE | 0
(_train_tune pid=3081) 1 | padder | ConstantPad1d | 0
(_train_tune pid=3081) 2 | scaler | TemporalNorm | 0
(_train_tune pid=3081) 3 | hist_encoder | LSTM | 484 K (_train_tune pid=3081) 4 | context_adapter | Linear | 733 K (_train_tune pid=3081) 5 | mlp_decoder | MLP | 2.4 K (_train_tune pid=3081) -------------------------------------------------- (_train_tune pid=3081) 1.2 M Trainable params (_train_tune pid=3081) 0 Non-trainable params (_train_tune pid=3081) 1.2 M Total params (_train_tune pid=3081) 4.880 Total estimated model params size (MB) Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s] 2024-05-15 18:49:56,077 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00012 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=3081, ip=172.28.0.12, actor_id=6c3cdc79fa3f106c5a87198b01000000, repr=_train_tune) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run self._ret = self._target(self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in training_func=lambda: self._trainable_func(self.config), File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func output = fn() File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/util.py", line 130, in inner return trainable(config, fn_kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 209, in _traintune = self._fit_model( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 357, in _fit_model model = model.fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 530, in fit return self._fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_model.py", line 219, in _fit trainer.fit(model, datamodule=datamodule) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_stage self._run_sanity_check() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1060, in _run_sanity_check val_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_args) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 412, in validation_step return self.lightning_module.validation_step(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 392, in validation_step output = self(windows_batch) # tuple([B, seq_len, H, output]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/models/lstm.py", line 210, in forward output = self.mlp_decoder(context) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_modules.py", line 60, in forward return self.layers(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 101, in forward return F.relu(input, inplace=self.inplace) File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1473, in relu result = torch.relu(input) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.72 GiB. GPU 0 has a total capacity of 14.75 GiB of which 1.07 GiB is free. Process 34044 has 13.68 GiB memory in use. Of the allocated memory 13.54 GiB is allocated by PyTorch, and 10.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Trial _train_tune_4003e_00012 errored after 0 iterations at 2024-05-15 18:49:56. Total running time: 2min 40s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00012_12_2024-05-15_18-47-16/error.txt

(_train_tune pid=3163) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=3163) Seed set to 1 (_train_tune pid=3163) GPU available: True (cuda), used: True (_train_tune pid=3163) TPU available: False, using: 0 TPU cores (_train_tune pid=3163) IPU available: False, using: 0 IPUs (_train_tune pid=3163) HPU available: False, using: 0 HPUs (_train_tune pid=3163) Trainer(val_check_interval=1) was configured so validation will run after every batch. (_train_tune pid=3163) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00013_13_2024-05-15_18-47-16/lightning_logs (_train_tune pid=3163) 2024-05-15 18:50:05.258245: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered (_train_tune pid=3163) 2024-05-15 18:50:05.258300: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered (_train_tune pid=3163) 2024-05-15 18:50:05.259671: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered (_train_tune pid=3163) 2024-05-15 18:50:06.660238: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT (_train_tune pid=3163) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] (_train_tune pid=3163) (_train_tune pid=3163) | Name | Type | Params (_train_tune pid=3163) -------------------------------------------------- (_train_tune pid=3163) 0 | loss | MAE | 0
(_train_tune pid=3163) 1 | padder | ConstantPad1d | 0
(_train_tune pid=3163) 2 | scaler | TemporalNorm | 0
(_train_tune pid=3163) 3 | hist_encoder | LSTM | 484 K (_train_tune pid=3163) 4 | context_adapter | Linear | 733 K (_train_tune pid=3163) 5 | mlp_decoder | MLP | 2.4 K (_train_tune pid=3163) -------------------------------------------------- (_train_tune pid=3163) 1.2 M Trainable params (_train_tune pid=3163) 0 Non-trainable params (_train_tune pid=3163) 1.2 M Total params (_train_tune pid=3163) 4.880 Total estimated model params size (MB) Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s] 2024-05-15 18:50:08,865 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00013 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=3163, ip=172.28.0.12, actor_id=52d743851cafdc7d4d83cebf01000000, repr=_train_tune) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run self._ret = self._target(self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in training_func=lambda: self._trainable_func(self.config), File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func output = fn() File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/util.py", line 130, in inner return trainable(config, fn_kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 209, in _traintune = self._fit_model( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 357, in _fit_model model = model.fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 530, in fit return self._fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_model.py", line 219, in _fit trainer.fit(model, datamodule=datamodule) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_stage self._run_sanity_check() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1060, in _run_sanity_check val_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_args) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 412, in validation_step return self.lightning_module.validation_step(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 392, in validation_step output = self(windows_batch) # tuple([B, seq_len, H, output]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/models/lstm.py", line 210, in forward output = self.mlp_decoder(context) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_modules.py", line 60, in forward return self.layers(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 101, in forward return F.relu(input, inplace=self.inplace) File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1473, in relu result = torch.relu(input) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.72 GiB. GPU 0 has a total capacity of 14.75 GiB of which 1.07 GiB is free. Process 34467 has 13.68 GiB memory in use. Of the allocated memory 13.54 GiB is allocated by PyTorch, and 10.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Trial _train_tune_4003e_00013 errored after 0 iterations at 2024-05-15 18:50:08. Total running time: 2min 52s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00013_13_2024-05-15_18-47-16/error.txt

(_train_tune pid=3254) Seed set to 1 (_train_tune pid=3254) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=3254) GPU available: True (cuda), used: True (_train_tune pid=3254) TPU available: False, using: 0 TPU cores (_train_tune pid=3254) IPU available: False, using: 0 IPUs (_train_tune pid=3254) HPU available: False, using: 0 HPUs (_train_tune pid=3254) Trainer(val_check_interval=1) was configured so validation will run after every batch. (_train_tune pid=3254) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00014_14_2024-05-15_18-47-16/lightning_logs (_train_tune pid=3254) 2024-05-15 18:50:19.166225: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered (_train_tune pid=3254) 2024-05-15 18:50:19.166276: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered (_train_tune pid=3254) 2024-05-15 18:50:19.167749: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered (_train_tune pid=3254) 2024-05-15 18:50:20.505432: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT (_train_tune pid=3254) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] (_train_tune pid=3254) (_train_tune pid=3254) | Name | Type | Params (_train_tune pid=3254) -------------------------------------------------- (_train_tune pid=3254) 0 | loss | MAE | 0
(_train_tune pid=3254) 1 | padder | ConstantPad1d | 0
(_train_tune pid=3254) 2 | scaler | TemporalNorm | 0
(_train_tune pid=3254) 3 | hist_encoder | LSTM | 484 K (_train_tune pid=3254) 4 | context_adapter | Linear | 733 K (_train_tune pid=3254) 5 | mlp_decoder | MLP | 2.4 K (_train_tune pid=3254) -------------------------------------------------- (_train_tune pid=3254) 1.2 M Trainable params (_train_tune pid=3254) 0 Non-trainable params (_train_tune pid=3254) 1.2 M Total params (_train_tune pid=3254) 4.880 Total estimated model params size (MB) Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s] 2024-05-15 18:50:22,545 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00014 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=3254, ip=172.28.0.12, actor_id=ecac20e4935be6c35f7019b801000000, repr=_train_tune) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run self._ret = self._target(self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in training_func=lambda: self._trainable_func(self.config), File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func output = fn() File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/util.py", line 130, in inner return trainable(config, fn_kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 209, in _traintune = self._fit_model( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 357, in _fit_model model = model.fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 530, in fit return self._fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_model.py", line 219, in _fit trainer.fit(model, datamodule=datamodule) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_stage self._run_sanity_check() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1060, in _run_sanity_check val_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_args) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 412, in validation_step return self.lightning_module.validation_step(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 392, in validation_step output = self(windows_batch) # tuple([B, seq_len, H, output]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/models/lstm.py", line 210, in forward output = self.mlp_decoder(context) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_modules.py", line 60, in forward return self.layers(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 101, in forward return F.relu(input, inplace=self.inplace) File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1473, in relu result = torch.relu(input) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.72 GiB. GPU 0 has a total capacity of 14.75 GiB of which 1.07 GiB is free. Process 35435 has 13.68 GiB memory in use. Of the allocated memory 13.54 GiB is allocated by PyTorch, and 10.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Trial _train_tune_4003e_00014 errored after 0 iterations at 2024-05-15 18:50:22. Total running time: 3min 6s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00014_14_2024-05-15_18-47-16/error.txt

(_train_tune pid=3350) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=3350) Seed set to 1 (_train_tune pid=3350) GPU available: True (cuda), used: True (_train_tune pid=3350) TPU available: False, using: 0 TPU cores (_train_tune pid=3350) IPU available: False, using: 0 IPUs (_train_tune pid=3350) HPU available: False, using: 0 HPUs (_train_tune pid=3350) Trainer(val_check_interval=1) was configured so validation will run after every batch. (_train_tune pid=3350) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00015_15_2024-05-15_18-47-16/lightning_logs (_train_tune pid=3350) 2024-05-15 18:50:32.416163: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered (_train_tune pid=3350) 2024-05-15 18:50:32.416219: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered (_train_tune pid=3350) 2024-05-15 18:50:32.417601: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered (_train_tune pid=3350) 2024-05-15 18:50:33.770497: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT (_train_tune pid=3350) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] (_train_tune pid=3350) (_train_tune pid=3350) | Name | Type | Params (_train_tune pid=3350) -------------------------------------------------- (_train_tune pid=3350) 0 | loss | MAE | 0
(_train_tune pid=3350) 1 | padder | ConstantPad1d | 0
(_train_tune pid=3350) 2 | scaler | TemporalNorm | 0
(_train_tune pid=3350) 3 | hist_encoder | LSTM | 484 K (_train_tune pid=3350) 4 | context_adapter | Linear | 733 K (_train_tune pid=3350) 5 | mlp_decoder | MLP | 2.4 K (_train_tune pid=3350) -------------------------------------------------- (_train_tune pid=3350) 1.2 M Trainable params (_train_tune pid=3350) 0 Non-trainable params (_train_tune pid=3350) 1.2 M Total params (_train_tune pid=3350) 4.880 Total estimated model params size (MB) Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s] 2024-05-15 18:50:35,801 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00015 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=3350, ip=172.28.0.12, actor_id=6069a3e61b3cec32c2edcb3901000000, repr=_train_tune) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run self._ret = self._target(self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in training_func=lambda: self._trainable_func(self.config), File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func output = fn() File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/util.py", line 130, in inner return trainable(config, fn_kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 209, in _traintune = self._fit_model( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 357, in _fit_model model = model.fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 530, in fit return self._fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_model.py", line 219, in _fit trainer.fit(model, datamodule=datamodule) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_stage self._run_sanity_check() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1060, in _run_sanity_check val_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_args) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 412, in validation_step return self.lightning_module.validation_step(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 392, in validation_step output = self(windows_batch) # tuple([B, seq_len, H, output]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/models/lstm.py", line 210, in forward output = self.mlp_decoder(context) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_modules.py", line 60, in forward return self.layers(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 101, in forward return F.relu(input, inplace=self.inplace) File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1473, in relu result = torch.relu(input) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.72 GiB. GPU 0 has a total capacity of 14.75 GiB of which 1.07 GiB is free. Process 36671 has 13.68 GiB memory in use. Of the allocated memory 13.54 GiB is allocated by PyTorch, and 10.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Trial _train_tune_4003e_00015 errored after 0 iterations at 2024-05-15 18:50:35. Total running time: 3min 19s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00015_15_2024-05-15_18-47-16/error.txt

(_train_tune pid=3440) Seed set to 1 (_train_tune pid=3440) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=3440) GPU available: True (cuda), used: True (_train_tune pid=3440) TPU available: False, using: 0 TPU cores (_train_tune pid=3440) IPU available: False, using: 0 IPUs (_train_tune pid=3440) HPU available: False, using: 0 HPUs (_train_tune pid=3440) Trainer(val_check_interval=1) was configured so validation will run after every batch. (_train_tune pid=3440) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00016_16_2024-05-15_18-47-16/lightning_logs (_train_tune pid=3440) 2024-05-15 18:50:45.394308: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered (_train_tune pid=3440) 2024-05-15 18:50:45.394386: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered (_train_tune pid=3440) 2024-05-15 18:50:45.396492: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered (_train_tune pid=3440) 2024-05-15 18:50:47.074955: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT (_train_tune pid=3440) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] (_train_tune pid=3440) (_train_tune pid=3440) | Name | Type | Params (_train_tune pid=3440) -------------------------------------------------- (_train_tune pid=3440) 0 | loss | MAE | 0
(_train_tune pid=3440) 1 | padder | ConstantPad1d | 0
(_train_tune pid=3440) 2 | scaler | TemporalNorm | 0
(_train_tune pid=3440) 3 | hist_encoder | LSTM | 484 K (_train_tune pid=3440) 4 | context_adapter | Linear | 733 K (_train_tune pid=3440) 5 | mlp_decoder | MLP | 2.4 K (_train_tune pid=3440) -------------------------------------------------- (_train_tune pid=3440) 1.2 M Trainable params (_train_tune pid=3440) 0 Non-trainable params (_train_tune pid=3440) 1.2 M Total params (_train_tune pid=3440) 4.880 Total estimated model params size (MB) Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s] 2024-05-15 18:50:49,131 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00016 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=3440, ip=172.28.0.12, actor_id=c305d2cbb98392178ed4779b01000000, repr=_train_tune) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run self._ret = self._target(self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in training_func=lambda: self._trainable_func(self.config), File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func output = fn() File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/util.py", line 130, in inner return trainable(config, fn_kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 209, in _traintune = self._fit_model( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 357, in _fit_model model = model.fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 530, in fit return self._fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_model.py", line 219, in _fit trainer.fit(model, datamodule=datamodule) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_stage self._run_sanity_check() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1060, in _run_sanity_check val_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_args) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 412, in validation_step return self.lightning_module.validation_step(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 392, in validation_step output = self(windows_batch) # tuple([B, seq_len, H, output]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/models/lstm.py", line 210, in forward output = self.mlp_decoder(context) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_modules.py", line 60, in forward return self.layers(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 101, in forward return F.relu(input, inplace=self.inplace) File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1473, in relu result = torch.relu(input) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.72 GiB. GPU 0 has a total capacity of 14.75 GiB of which 1.07 GiB is free. Process 37881 has 13.68 GiB memory in use. Of the allocated memory 13.54 GiB is allocated by PyTorch, and 10.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Trial _train_tune_4003e_00016 errored after 0 iterations at 2024-05-15 18:50:49. Total running time: 3min 33s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00016_16_2024-05-15_18-47-16/error.txt

(_train_tune pid=3535) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=3535) Seed set to 1 (_train_tune pid=3535) GPU available: True (cuda), used: True (_train_tune pid=3535) TPU available: False, using: 0 TPU cores (_train_tune pid=3535) IPU available: False, using: 0 IPUs (_train_tune pid=3535) HPU available: False, using: 0 HPUs (_train_tune pid=3535) Trainer(val_check_interval=1) was configured so validation will run after every batch. (_train_tune pid=3535) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00017_17_2024-05-15_18-47-16/lightning_logs (_train_tune pid=3535) 2024-05-15 18:50:57.128093: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered (_train_tune pid=3535) 2024-05-15 18:50:57.128330: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered (_train_tune pid=3535) 2024-05-15 18:50:57.130380: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered (_train_tune pid=3535) 2024-05-15 18:50:59.109092: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT (_train_tune pid=3535) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] (_train_tune pid=3535) (_train_tune pid=3535) | Name | Type | Params (_train_tune pid=3535) -------------------------------------------------- (_train_tune pid=3535) 0 | loss | MAE | 0
(_train_tune pid=3535) 1 | padder | ConstantPad1d | 0
(_train_tune pid=3535) 2 | scaler | TemporalNorm | 0
(_train_tune pid=3535) 3 | hist_encoder | LSTM | 484 K (_train_tune pid=3535) 4 | context_adapter | Linear | 733 K (_train_tune pid=3535) 5 | mlp_decoder | MLP | 2.4 K (_train_tune pid=3535) -------------------------------------------------- (_train_tune pid=3535) 1.2 M Trainable params (_train_tune pid=3535) 0 Non-trainable params (_train_tune pid=3535) 1.2 M Total params (_train_tune pid=3535) 4.880 Total estimated model params size (MB) Sanity Checking: | | 0/? [00:00<?, ?it/s] Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s] 2024-05-15 18:51:01,882 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00017 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=3535, ip=172.28.0.12, actor_id=e8d645cb71673bfc6acc62ba01000000, repr=_train_tune) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run self._ret = self._target(self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in training_func=lambda: self._trainable_func(self.config), File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func output = fn() File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/util.py", line 130, in inner return trainable(config, fn_kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 209, in _traintune = self._fit_model( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 357, in _fit_model model = model.fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 530, in fit return self._fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_model.py", line 219, in _fit trainer.fit(model, datamodule=datamodule) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_stage self._run_sanity_check() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1060, in _run_sanity_check val_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_args) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 412, in validation_step return self.lightning_module.validation_step(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 392, in validation_step output = self(windows_batch) # tuple([B, seq_len, H, output]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/models/lstm.py", line 210, in forward output = self.mlp_decoder(context) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_modules.py", line 60, in forward return self.layers(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 101, in forward return F.relu(input, inplace=self.inplace) File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1473, in relu result = torch.relu(input) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.72 GiB. GPU 0 has a total capacity of 14.75 GiB of which 1.07 GiB is free. Process 39098 has 13.68 GiB memory in use. Of the allocated memory 13.54 GiB is allocated by PyTorch, and 10.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Trial _train_tune_4003e_00017 errored after 0 iterations at 2024-05-15 18:51:01. Total running time: 3min 45s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00017_17_2024-05-15_18-47-16/error.txt

(_train_tune pid=3625) Seed set to 1 (_train_tune pid=3625) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=3625) GPU available: True (cuda), used: True (_train_tune pid=3625) TPU available: False, using: 0 TPU cores (_train_tune pid=3625) IPU available: False, using: 0 IPUs (_train_tune pid=3625) HPU available: False, using: 0 HPUs (_train_tune pid=3625) Trainer(val_check_interval=1) was configured so validation will run after every batch. (_train_tune pid=3625) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00018_18_2024-05-15_18-47-16/lightning_logs (_train_tune pid=3625) 2024-05-15 18:51:09.586569: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered (_train_tune pid=3625) 2024-05-15 18:51:09.586624: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered (_train_tune pid=3625) 2024-05-15 18:51:09.588053: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered (_train_tune pid=3625) 2024-05-15 18:51:10.939656: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT (_train_tune pid=3625) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] (_train_tune pid=3625) (_train_tune pid=3625) | Name | Type | Params (_train_tune pid=3625) -------------------------------------------------- (_train_tune pid=3625) 0 | loss | MAE | 0
(_train_tune pid=3625) 1 | padder | ConstantPad1d | 0
(_train_tune pid=3625) 2 | scaler | TemporalNorm | 0
(_train_tune pid=3625) 3 | hist_encoder | LSTM | 484 K (_train_tune pid=3625) 4 | context_adapter | Linear | 733 K (_train_tune pid=3625) 5 | mlp_decoder | MLP | 2.4 K (_train_tune pid=3625) -------------------------------------------------- (_train_tune pid=3625) 1.2 M Trainable params (_train_tune pid=3625) 0 Non-trainable params (_train_tune pid=3625) 1.2 M Total params (_train_tune pid=3625) 4.880 Total estimated model params size (MB) Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s] 2024-05-15 18:51:13,382 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00018 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=3625, ip=172.28.0.12, actor_id=b85224b75f519021d2f8974001000000, repr=_train_tune) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run self._ret = self._target(self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in training_func=lambda: self._trainable_func(self.config), File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func output = fn() File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/util.py", line 130, in inner return trainable(config, fn_kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 209, in _traintune = self._fit_model( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 357, in _fit_model model = model.fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 530, in fit return self._fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_model.py", line 219, in _fit trainer.fit(model, datamodule=datamodule) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_stage self._run_sanity_check() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1060, in _run_sanity_check val_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_args) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 412, in validation_step return self.lightning_module.validation_step(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 392, in validation_step output = self(windows_batch) # tuple([B, seq_len, H, output]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/models/lstm.py", line 210, in forward output = self.mlp_decoder(context) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_modules.py", line 60, in forward return self.layers(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 101, in forward return F.relu(input, inplace=self.inplace) File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1473, in relu result = torch.relu(input) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.72 GiB. GPU 0 has a total capacity of 14.75 GiB of which 1.07 GiB is free. Process 40264 has 13.68 GiB memory in use. Of the allocated memory 13.54 GiB is allocated by PyTorch, and 10.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Trial _train_tune_4003e_00018 errored after 0 iterations at 2024-05-15 18:51:13. Total running time: 3min 57s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00018_18_2024-05-15_18-47-16/error.txt

(_train_tune pid=3707) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback instead. (_train_tune pid=3707) Seed set to 1 (_train_tune pid=3707) GPU available: True (cuda), used: True (_train_tune pid=3707) TPU available: False, using: 0 TPU cores (_train_tune pid=3707) IPU available: False, using: 0 IPUs (_train_tune pid=3707) HPU available: False, using: 0 HPUs (_train_tune pid=3707) Trainer(val_check_interval=1) was configured so validation will run after every batch. (_train_tune pid=3707) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00019_19_2024-05-15_18-47-16/lightning_logs (_train_tune pid=3707) 2024-05-15 18:51:22.313082: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered (_train_tune pid=3707) 2024-05-15 18:51:22.313161: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered (_train_tune pid=3707) 2024-05-15 18:51:22.314817: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered (_train_tune pid=3707) 2024-05-15 18:51:23.665602: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT (_train_tune pid=3707) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] (_train_tune pid=3707) (_train_tune pid=3707) | Name | Type | Params (_train_tune pid=3707) -------------------------------------------------- (_train_tune pid=3707) 0 | loss | MAE | 0
(_train_tune pid=3707) 1 | padder | ConstantPad1d | 0
(_train_tune pid=3707) 2 | scaler | TemporalNorm | 0
(_train_tune pid=3707) 3 | hist_encoder | LSTM | 484 K (_train_tune pid=3707) 4 | context_adapter | Linear | 733 K (_train_tune pid=3707) 5 | mlp_decoder | MLP | 2.4 K (_train_tune pid=3707) -------------------------------------------------- (_train_tune pid=3707) 1.2 M Trainable params (_train_tune pid=3707) 0 Non-trainable params (_train_tune pid=3707) 1.2 M Total params (_train_tune pid=3707) 4.880 Total estimated model params size (MB) Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s] 2024-05-15 18:51:25,685 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00019 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=3707, ip=172.28.0.12, actor_id=a454d5dc5c3d4c81658a621601000000, repr=_train_tune) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run self._ret = self._target(self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in training_func=lambda: self._trainable_func(self.config), File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func output = fn() File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/util.py", line 130, in inner return trainable(config, fn_kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 209, in _traintune = self._fit_model( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_auto.py", line 357, in _fit_model model = model.fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 530, in fit return self._fit( File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_model.py", line 219, in _fit trainer.fit(model, datamodule=datamodule) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_stage self._run_sanity_check() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1060, in _run_sanity_check val_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_args) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 412, in validation_step return self.lightning_module.validation_step(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_base_recurrent.py", line 392, in validation_step output = self(windows_batch) # tuple([B, seq_len, H, output]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/models/lstm.py", line 210, in forward output = self.mlp_decoder(context) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/neuralforecast/common/_modules.py", line 60, in forward return self.layers(x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 101, in forward return F.relu(input, inplace=self.inplace) File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1473, in relu result = torch.relu(input) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.72 GiB. GPU 0 has a total capacity of 14.75 GiB of which 1.07 GiB is free. Process 40709 has 13.68 GiB memory in use. Of the allocated memory 13.54 GiB is allocated by PyTorch, and 10.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2024-05-15 18:51:25,714 INFO tune.py:1007 -- Wrote the latest version of all result files and experiment state to '/root/ray_results/_train_tune_2024-05-15_18-47-11' in 0.0219s. 2024-05-15 18:51:25,720 ERROR tune.py:1035 -- Trials did not complete: [_train_tune_4003e_00000, _train_tune_4003e_00001, _train_tune_4003e_00002, _train_tune_4003e_00003, _train_tune_4003e_00004, _train_tune_4003e_00005, _train_tune_4003e_00006, _train_tune_4003e_00007, _train_tune_4003e_00008, _train_tune_4003e_00009, _train_tune_4003e_00010, _train_tune_4003e_00011, _train_tune_4003e_00012, _train_tune_4003e_00013, _train_tune_4003e_00014, _train_tune_4003e_00015, _train_tune_4003e_00016, _train_tune_4003e_00017, _train_tune_4003e_00018, _train_tune_4003e_00019] 2024-05-15 18:51:25,750 WARNING experiment_analysis.py:558 -- Could not find best trial. Did you pass the correct metric parameter?

Trial _train_tune_4003e_00019 errored after 0 iterations at 2024-05-15 18:51:25. Total running time: 4min 9s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00019_19_2024-05-15_18-47-16/error.txt


RuntimeError Traceback (most recent call last)

in () 14 15 # Fit model ---> 16 nf.fit(df=df_daily, val_size=12, sort_df=True, verbose=True) 17 Y_hat_df = nf.predict() 18 Y_hat_df = Y_hat_df.reset_index() 2 frames /usr/local/lib/python3.10/dist-packages/ray/tune/result_grid.py in get_best_result(self, metric, mode, scope, filter_nan_and_inf) 159 else "." 160 ) --> 161 raise RuntimeError(error_msg) 162 163 return self._trial_to_result(best_trial) RuntimeError: No best trial found for the given metric: loss. This means that no trial has reported this metric, or all values reported for this metric are NaN. To not ignore NaN values, you can set the `filter_nan_and_inf` arg to False.
elephaint commented 1 month ago

It seems to give a GPU OOM error. So your GPU doesn't have enough RAM to run this task. This would also explain why it does run on weekly/monthly, as those will have less data than daily.

The solution is to buy a GPU with more RAM, or run a less compute-intensive experiment (e.g. use less data, or less frequent)

sm-ak-r33 commented 1 month ago

It seems to give a GPU OOM error. So your GPU doesn't have enough RAM to run this task. This would also explain why it does run on weekly/monthly, as those will have less data than daily.

The solution is to buy a GPU with more RAM, or run a less compute-intensive experiment (e.g. use less data, or less frequent)

But I am running it in MS Azure on a compute with GPU and with a RAM of 112GB

elephaint commented 1 month ago

It seems to give a GPU OOM error. So your GPU doesn't have enough RAM to run this task. This would also explain why it does run on weekly/monthly, as those will have less data than daily. The solution is to buy a GPU with more RAM, or run a less compute-intensive experiment (e.g. use less data, or less frequent)

But I am running it in MS Azure on a compute with GPU and with a RAM of 112GB

The error relates to GPU RAM not normal (CPU) RAM. So the GPU in your Azure machine does not have enough RAM. Choose an instance that has an A100, for example, that will give you more GPU RAM.

sm-ak-r33 commented 1 month ago

Funny it worked after I ran a loop for each unique_id with the same resources like:

unique_ids = df_daily['unique_id'].unique() for unique_id in unique_ids:

df_sub = df_daily[df_daily['unique_id'] == unique_id]

nf.fit(df=df_sub, val_size=365, sort_df=True, verbose=True)

Y_hat_df = nf.predict()
Y_hat_df['unique_id'] = unique_id  # Add the unique_id to the predictions

all_predictions.append(Y_hat_df.reset_index())