Open Teculos opened 3 months ago
I think i've isolated the bug further.
It seems as if it's an issue between torch for GPU vs torch for CPU. as the models run with no issue for torch 2.4.0+cpu
. I also tried all torch versions 2.3.0
- 2.4.0
and all GPU compliant versions fail.
That's good to know, because I was not able to reproduce the errors at all! I'll see if we can have a fix for that on our end, or if it's out of our control.
I found another important wrinkle, seems that the bug not only has to due with torch for GPUs but regarding the amount of GPUs available.
The system im using is managed by slurm and a session with the following parameters works
srun --pty -A pmg --time 0-04:00 --gpus 1 --cpus-per-task 5 --mem-per-cpu 5G /bin/bash
but a session with the below parameters fail
srun --pty -A pmg --time 0-04:00 --gpus 2 --cpus-per-task 5 --mem-per-cpu 5G /bin/bash
My assumption is that there is a mismatch somewhere between the number of GPUs indicated in the model definition to use vs the number models are actually using causing some dimension mismatch. Although I'm more of a JAX guy so I have no further insights into what torch/pytorch_lightning may be doing.
So I guess the issue is not as catastrophic as it might have been but this is going to seriously limit the scalability of these models, and with the recent Neurips 2024 workshop Time Series in the Age of Large Models it would be nice to have a more scalable package to use for submission (as I intend to do)
pytorch lightning uses all available GPUs by default (you should see this in the logs as the training starts) by using data parallelism and I think this doesn't play well with multivariate models because they require the batches to have all series. A possible solution would be to set the batch size to n_series * n_gpus so that each GPU gets n_series, however I think we have checks in place that won't allow this.
In the meantime the only solution is to limit those models to one GPU, which you can do by setting devices=1
or devices=[DEVICE_ID]
in the model constructor.
FYI I was just trying to run some larger LLMs (specifically Gemma2-2B) with time-LLM and this issue seems to appear here as well when i try to shard the model across GPUs. RuntimeError: mat1 and mat2 shapes cannot be multiplied (1024x2304 and 768x1024)
What happened + What you expected to happen
I'm seeing multiple issues (all related to matrix dimensions it seems) for all multivariate models (Except HINT because I could not determine S parameter from documentation and SOFT which seems to work). This is reproducible in both standard models and Auto models.
Errors presented are not the full stack but reduced for cleanliness:
TSMixer
TSMixerx
TimeMixer
StemGNN
MLPMultivariate
Versions / Dependencies
neuralforecast is 1.7.4 datasetsforecast is 0.0.8 pytorch_lightning is 2.3.0 torch is 2.4.0+cu121
Reproduction script
Reproduced the error with only Nixtla related packages, including NHITS as working example
I also reproduced these errors with the base models.
Issue Severity
High: It blocks me from completing my task.