Nixtla / neuralforecast

Scalable and user friendly neural :brain: forecasting algorithms.
https://nixtlaverse.nixtla.io/neuralforecast
Apache License 2.0
2.69k stars 312 forks source link

[common] Issue with multi-gpu and `ddp_spawn` strategy when running predict #1037

Open matthieuhumeau opened 3 weeks ago

matthieuhumeau commented 3 weeks ago

What happened + What you expected to happen

The predict method fails with the following error when the model has been trained on multi-gpu with ddp_spawn strategy: TypeError: vstack(): argument 'tensors' (position 1) must be tuple of Tensors, not NoneType

This seems to be an issue with the PyTorch Lightning Trainer returning None when calling predict with multi-gpu. Looks like there is already an existing fix for this: https://github.com/Nixtla/neuralforecast/pull/391/files But the issue persists on my side. I was able to resolve it by modifying common/_base_windows.py to drop the strategy argument from my trainer_kwargs.

I'm using:

trainer_kwargs = {
        'accelerator': 'gpu',
        'devices': 8, 
        'strategy': 'ddp_spawn'  # Distributed Data Parallel strategy
    }

Versions / Dependencies

Running this on Sagemaker (AL2, 5.10.215-203.850.amzn2.x86_64) Python 3.10 torch==2.1.0 pytorch-lightning==2.2.5 neuralforecast==1.7.2

Reproduction script

from utilsforecast.data import generate_series
from neuralforecast import NeuralForecast
from neuralforecast.models import NBEATS 
from neuralforecast.losses.pytorch import DistributionLoss
import torch
torch.set_float32_matmul_precision('high')

def main():
    series = generate_series(10, min_length=200, max_length=500)
    h = 7
    valid = series.groupby('unique_id', observed=True).tail(h)
    train = series.drop(valid.index)

    trainer_kwargs = {
            'accelerator': 'gpu',
            'devices': 8,
            'strategy': 'ddp_spawn'}

    models = NBEATS(h=h,
                    input_size=7,
                    loss=DistributionLoss(distribution='Poisson', level=[90]),
                    max_steps=100,
                    scaler_type='standard',
                    **trainer_kwargs)

    model = NeuralForecast(models=[models], freq='D', )
    model.fit(train)

    p = model.predict(train)

if __name__ == "__main__":
    main()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

jmoralez commented 2 weeks ago

Hey @matthieuhumeau, thanks for the detailed report. I also believe we need to reset the strategy in the current check, since it's being kept as ddp_spawn.

In the meantime you can remove that manually after training, e.g.

model.fit(train)
del model.models[0].trainer_kwargs['strategy']
p = model.predict(trian)