All Multivariate Models (Except SOFTS) cannot be trained on multiple GPUs

Teculos commented 3 months ago

What happened + What you expected to happen

I'm seeing multiple issues (all related to matrix dimensions it seems) for all multivariate models (Except HINT because I could not determine S parameter from documentation and SOFT which seems to work). This is reproducible in both standard models and Auto models.

Errors presented are not the full stack but reduced for cleanliness:

TSMixer

  File "/burg/pmg/users/aec2244/mambaforge/torch/lib/python3.10/site-packages/neuralforecast/models/tsmixer.py", line 118, in forward
    x = x * self.weight
RuntimeError: The size of tensor a (4) must match the size of tensor b (7) at non-singleton dimension 2

TSMixerx

  File "/burg/pmg/users/aec2244/mambaforge/torch/lib/python3.10/site-packages/neuralforecast/models/tsmixerx.py", line 146, in forward
    x = x * self.weight
RuntimeError: The size of tensor a (4) must match the size of tensor b (7) at non-singleton dimension 3

TimeMixer

  File "/burg/pmg/users/aec2244/mambaforge/torch/lib/python3.10/site-packages/neuralforecast/models/timemixer.py", line 85, in _normalize
    x = x * self.affine_weight
RuntimeError: The size of tensor a (4) must match the size of tensor b (7) at non-singleton dimension 2

StemGNN

  File "/burg/pmg/users/aec2244/mambaforge/torch/lib/python3.10/site-packages/neuralforecast/models/stemgnn.py", line 339, in self_graph_attention
    key = torch.matmul(input, self.weight_key)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (175x4 and 7x1)

MLPMultivariate

  File "/burg/pmg/users/aec2244/mambaforge/torch/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 117, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (25x384 and 672x1024)

Versions / Dependencies

neuralforecast is 1.7.4 datasetsforecast is 0.0.8 pytorch_lightning is 2.3.0 torch is 2.4.0+cu121

Reproduction script

Reproduced the error with only Nixtla related packages, including NHITS as working example


import optuna
import pandas as pd

from neuralforecast import NeuralForecast
from neuralforecast.losses.pytorch import MAE
from neuralforecast.auto import AutoStemGNN, AutoTSMixer, AutoTSMixerx, AutoNHITS,AutoSOFTS, AutoHint, AutoTimeMixer, AutoMLPMultivariate
from neuralforecast.models import NHITS

from datasetsforecast.long_horizon import LongHorizon

# Change this to your own data to try the model
Y_df, _, _ = LongHorizon.load(directory='./', group='ETTm2')
Y_df['ds'] = pd.to_datetime(Y_df['ds'])
Y_df = Y_df[["ds", "unique_id","y"]]

#need to reduce this for memory reasons on the system im using (should not affect reproduction of issue)
Y_df = Y_df[Y_df.ds <= Y_df.ds.median()]
Y_df = Y_df[Y_df.ds <= Y_df.ds.median()]

H=96
num_samples=10
num_gpus=1

nhits_default_config = AutoNHITS.get_default_config(h=H, backend="optuna")
tsmixer_default_config = AutoTSMixer.get_default_config(h=H, backend="optuna", n_series=Y_df["unique_id"].nunique())
tsmixerx_default_config = AutoTSMixerx.get_default_config(h=H, backend="optuna", n_series=Y_df["unique_id"].nunique())
stemgnn_default_config = AutoStemGNN.get_default_config(h=H, backend="optuna", n_series=Y_df["unique_id"].nunique())
mlp_default_config = AutoMLPMultivariate.get_default_config(h=H, backend="optuna", n_series=Y_df["unique_id"].nunique())
timemix_default_config = AutoTimeMixer.get_default_config(h=H, backend="optuna", n_series=Y_df["unique_id"].nunique())

#these work
soft_default_config = AutoSOFTS.get_default_config(h=H, backend="optuna", n_series=Y_df["unique_id"].nunique())

models = [AutoNHITS(h=H,
                    config=nhits_default_config,
                    gpus=num_gpus,
                    valid_loss=MAE(),
                    search_alg=optuna.samplers.TPESampler(),
                    backend="optuna",
                    num_samples=num_samples),
        AutoTimeMixer(h=H,
                    n_series=Y_df["unique_id"].nunique(),
                    config=timemix_default_config,
                    gpus=num_gpus,
                    valid_loss=MAE(),
                    search_alg=optuna.samplers.TPESampler(),
                    backend='optuna',
                    num_samples=num_samples),
        AutoSOFTS(h=H,
                    n_series=Y_df["unique_id"].nunique(),
                    config=soft_default_config,
                    gpus=num_gpus,
                    valid_loss=MAE(),
                    search_alg=optuna.samplers.TPESampler(),
                    backend='optuna',
                    num_samples=num_samples),
        AutoTSMixer(h=H,
                    n_series=Y_df["unique_id"].nunique(),
                    config=tsmixer_default_config,
                    gpus=num_gpus,
                    valid_loss=MAE(),
                    search_alg=optuna.samplers.TPESampler(),
                    backend='optuna',
                    num_samples=num_samples),
        AutoTSMixerx(h=H,
                    n_series=Y_df["unique_id"].nunique(),
                    config=tsmixerx_default_config,
                    gpus=num_gpus,
                    valid_loss=MAE(),
                    search_alg=optuna.samplers.TPESampler(),
                    backend='optuna',
                    num_samples=num_samples),
          AutoStemGNN(h=H,
                    n_series=Y_df["unique_id"].nunique(),
                    config=stemgnn_default_config,
                    gpus=num_gpus,
                    valid_loss=MAE(),
                    search_alg=optuna.samplers.TPESampler(),
                    backend='optuna',
                    num_samples=num_samples),
          AutoMLPMultivariate(h=H,
                    n_series=Y_df["unique_id"].nunique(),
                    config=mlp_default_config,
                    gpus=num_gpus,
                    valid_loss=MAE(),
                    search_alg=optuna.samplers.TPESampler(),
                    backend='optuna',
                    num_samples=num_samples)
    ]

nf = NeuralForecast(models=[models[1]], freq="15min")

Y_df.sort_values("ds", inplace=True)
nf.fit(df=Y_df)

I also reproduced these errors with the base models.


from neuralforecast.models import StemGNN, TSMixer, HINT, TSMixerx, NHITS, MLPMultivariate, SOFTS, TimeMixer

models = [NHITS(h=H,
                input_size=H,
                windows_batch_size = 25,
                max_steps = 100,
                loss=MAE()),
          TimeMixer(h=H,
                input_size=H,
                batch_size = 25,
                max_steps = 100,
                n_series=Y_df["unique_id"].nunique(),
                loss=MAE()),
          SOFTS(h=H,
                input_size=H,
                batch_size = 25,
                max_steps = 100,
                n_series=Y_df["unique_id"].nunique(),
                loss=MAE()),
          TSMixer(h=H,
                  input_size=H,
                  batch_size = 25,
                  max_steps = 100,
                  n_series=Y_df["unique_id"].nunique(),
                  loss=MAE()),
          TSMixerx(h=H,
                  input_size=H,
                  batch_size = 25,
                  max_steps = 100,
                  n_series=Y_df["unique_id"].nunique(),
                  loss=MAE()),
          StemGNN(h=H,
                  input_size=H,
                  batch_size = 25,
                  max_steps = 100,
                  n_series=Y_df["unique_id"].nunique(),
                  loss=MAE()),
         MLPMultivariate(h=H,
                input_size=H,
                n_series=Y_df["unique_id"].nunique(),
                batch_size = 25,
                max_steps = 100,
                loss=MAE())]

nf = NeuralForecast(models=[models[1]], freq="15min")

Y_df.sort_values("ds", inplace=True)
nf.fit(df=Y_df)

Issue Severity

High: It blocks me from completing my task.

Teculos commented 3 months ago

I think i've isolated the bug further.

It seems as if it's an issue between torch for GPU vs torch for CPU. as the models run with no issue for torch 2.4.0+cpu . I also tried all torch versions 2.3.0 - 2.4.0 and all GPU compliant versions fail.

marcopeix commented 3 months ago

That's good to know, because I was not able to reproduce the errors at all! I'll see if we can have a fix for that on our end, or if it's out of our control.

Teculos commented 3 months ago

I found another important wrinkle, seems that the bug not only has to due with torch for GPUs but regarding the amount of GPUs available.

The system im using is managed by slurm and a session with the following parameters works

srun --pty -A pmg --time 0-04:00 --gpus 1 --cpus-per-task 5 --mem-per-cpu 5G /bin/bash

but a session with the below parameters fail

 srun --pty -A pmg --time 0-04:00 --gpus 2 --cpus-per-task 5 --mem-per-cpu 5G /bin/bash

My assumption is that there is a mismatch somewhere between the number of GPUs indicated in the model definition to use vs the number models are actually using causing some dimension mismatch. Although I'm more of a JAX guy so I have no further insights into what torch/pytorch_lightning may be doing.

So I guess the issue is not as catastrophic as it might have been but this is going to seriously limit the scalability of these models, and with the recent Neurips 2024 workshop Time Series in the Age of Large Models it would be nice to have a more scalable package to use for submission (as I intend to do)

jmoralez commented 3 months ago

pytorch lightning uses all available GPUs by default (you should see this in the logs as the training starts) by using data parallelism and I think this doesn't play well with multivariate models because they require the batches to have all series. A possible solution would be to set the batch size to n_series * n_gpus so that each GPU gets n_series, however I think we have checks in place that won't allow this.

In the meantime the only solution is to limit those models to one GPU, which you can do by setting devices=1 or devices=[DEVICE_ID] in the model constructor.

Teculos commented 3 months ago

FYI I was just trying to run some larger LLMs (specifically Gemma2-2B) with time-LLM and this issue seems to appear here as well when i try to shard the model across GPUs. RuntimeError: mat1 and mat2 shapes cannot be multiplied (1024x2304 and 768x1024)

Nixtla / neuralforecast