Unable to perform make_evaluation_predictions

Description

When trying to invoke make_evaluation_predictions in my backtesting function the output does not stop at Running evaluation: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.65it/s]

but proceeds to invoke training processes that first exhaust the VRAM, and the scheduler so that new CUDA processes can no longer be started, then it proceeds to occupy 99% of the available CPU cycles (the kernel does not kill the processes so I am assuming it is not 100%) and proceeds to start new training cycles.

To Reproduce

Plug in this function in your code and feed it the data it needs:

def backtest_model(test_dataset, model):
    # Make forecast
    forecast_it, ts_it = make_evaluation_predictions(
        dataset=test_dataset,  # test dataset
        predictor=model,  # model
        num_samples=100,  # number of sample paths we want for evaluation
    )

    forecasts = list(forecast_it)
    tss = list(ts_it)

    # Calculate accuracy metrics
    evaluator = Evaluator(quantiles=[0.1, 0.5, 0.9])
    agg_metrics, item_metrics = evaluator(iter(tss), iter(forecasts), num_series=len(test_dataset))

    return agg_metrics, item_metrics

3.10.9

Error message or code output

The cascade of events is too long to paste in its entirety so here is a tiny snippet of what ends up on the console:


`Trainer.fit` stopped: `max_epochs=75` reached.
Epoch 74: : 50it [00:07,  6.77it/s, v_num=83, val_loss=361.0, train_loss=173.0]
/home/******/anaconda3/envs/gluonts/lib/python3.10/site-packages/pytorch_lightning/utilities/parsing.py:197: UserWarning: Attribute 'model' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['model'])`.
  rank_zero_warn(
Running evaluation: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.65it/s]
/home/******/anaconda3/envs/gluonts/lib/python3.10/site-packages/pytorch_lightning/utilities/parsing.py:197: UserWarning: Attribute 'model' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['model'])`.
  rank_zero_warn(
...
/home/******/anaconda3/envs/gluonts/lib/python3.10/site-packages/pytorch_lightning/utilities/parsing.py:197: UserWarning: Attribute 'model' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['model'])`.
  rank_zero_warn(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/******/anaconda3/envs/gluonts/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:67: UserWarning: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
  warning_cache.warn(
/home/******/anaconda3/envs/gluonts/lib/python3.10/site-packages/pytorch_lightning/utilities/parsing.py:197: UserWarning: Attribute 'model' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['model'])`.
  rank_zero_warn(
...
/home/******/anaconda3/envs/gluonts/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:67: UserWarning: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
  warning_cache.warn(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/******/anaconda3/envs/gluonts/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:67: UserWarning: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
  warning_cache.warn(
...
 File "/home/******/anaconda3/envs/gluonts/lib/python3.10/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available
...
 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
...
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                           | Params | In sizes                                                                           | Out sizes
----------------------------------------------------------------------------------------------------------------------------------------------------------
0 | model | TemporalFusionTransformerModel | 222 K  | [[1, 30], [1, 30], [1, 1], [1, 1], [1, 37, 4], [1, 37, 0], [1, 30, 0], [1, 30, 0]] | [1, 9, 7]
----------------------------------------------------------------------------------------------------------------------------------------------------------
222 K     Trainable params
0         Non-trainable params
222 K     Total params
0.890     Total estimated model params size (MB)
...
Sanity Checking: 0it [00:00, ?it/s]/home/******/anaconda3/envs/gluonts/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, val_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 24 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
...
  rank_zero_warn(
Epoch 0: : 5it [00:00,  5.36it/s, v_num=84]/home/******/anaconda3/envs/gluonts/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, val_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 24 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
...

put error or undesired output here

Environment

Operating system: 5.15.68.1-microsoft-standard-WSL2+ #2 SMP Sun Oct 2 09:50:15 CEST 2022 x86_64 x86_64 x86_64 GNU/Linux
Python version: 3.10.9
GluonTS version: 0.12.7
MXNet version: 1.9.1
PyTorch version: 2.0.1
Lightning version: 2.0.2

awslabs / gluonts