Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.41k stars 3.29k forks source link

Multi-node Training with DDP stuck at "Initialize distributed..." on SLURM cluster #19817

Open OswaldHe opened 2 months ago

OswaldHe commented 2 months ago

Bug description

I'm working on a slurm cluster with 8 AMD MI100 GPUs distributed in 2 nodes, with 4 GPUs in each node. I follow the instructions (https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html) to submit a multi-node training job, but the job stuck at "Initializing distributed:...". I checked all related issues and none of them solve the problem.

What version are you seeing the problem on?

v2.2

How to reproduce the bug

Training Script:


import os
from torch import optim, nn, utils, Tensor
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
import lightning as L

# define any number of nn.Modules (or use your current ones)
encoder = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))
decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))

# define the LightningModule
class LitAutoEncoder(L.LightningModule):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def training_step(self, batch, batch_idx):
        # training_step defines the train loop.
        # it is independent of forward
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = nn.functional.mse_loss(x_hat, x)
        # Logging to TensorBoard (if installed) by default
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

# init the autoencoder
autoencoder = LitAutoEncoder(encoder, decoder)

# setup data
dataset = MNIST(os.getcwd(), download=True, transform=ToTensor())
train_loader = utils.data.DataLoader(dataset)

# train the model (hint: here are some helpful Trainer arguments for rapid idea iteration)
trainer = L.Trainer(limit_train_batches=100, max_epochs=1, num_nodes=2, devices=4, strategy="ddp")
trainer.fit(model=autoencoder, train_dataloaders=train_loader)

SLURM batch script:

#!/bin/bash

#SBATCH -p mi1004x
#SBATCH --nodes=2             # This needs to match Trainer(num_nodes=...)
#SBATCH --ntasks-per-node=4   # This needs to match Trainer(devices=...)
#SBATCH --time=0-00:30:00
#SBATCH -e slurm-%j.err

source ~/miniconda3/bin/activate pylight
# run script from above
srun python train.py

Error messages and logs

Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('AMD Instinct MI100') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8

Environment

Current environment * CUDA: - GPU: - AMD Instinct MI100 - AMD Instinct MI100 - AMD Instinct MI100 - AMD Instinct MI100 - available: True - version: None * Lightning: - lightning: 2.2.1 - lightning-utilities: 0.11.2 - pytorch-lightning: 2.2.1 - pytorch-triton-rocm: 2.2.0 - torch: 2.2.0+rocm5.6 - torchaudio: 2.2.0+rocm5.6 - torchmetrics: 1.3.2 - torchvision: 0.17.0+rocm5.6 * Packages: - absl-py: 2.1.0 - aiohttp: 3.9.3 - aiosignal: 1.3.1 - annotated-types: 0.6.0 - async-timeout: 4.0.3 - attrs: 23.2.0 - certifi: 2022.12.7 - charset-normalizer: 2.1.1 - deepspeed: 0.14.0 - filelock: 3.9.0 - frozenlist: 1.4.1 - fsspec: 2023.4.0 - future: 1.0.0 - grpcio: 1.62.1 - hjson: 3.1.0 - idna: 3.4 - imageio: 2.34.0 - jinja2: 3.1.2 - lightning: 2.2.1 - lightning-utilities: 0.11.2 - markdown: 3.6 - markupsafe: 2.1.3 - mpmath: 1.3.0 - multidict: 6.0.5 - networkx: 3.2.1 - ninja: 1.11.1.1 - numpy: 1.26.3 - packaging: 24.0 - pandas: 2.2.1 - pillow: 10.2.0 - pip: 23.3.1 - protobuf: 5.26.1 - psutil: 5.9.8 - py-cpuinfo: 9.0.0 - pydantic: 2.7.0 - pydantic-core: 2.18.1 - pynvml: 11.5.0 - python-dateutil: 2.9.0.post0 - pytorch-lightning: 2.2.1 - pytorch-triton-rocm: 2.2.0 - pytz: 2024.1 - pyyaml: 6.0.1 - requests: 2.28.1 - setuptools: 68.2.2 - six: 1.16.0 - sympy: 1.12 - tensorboard: 2.16.2 - tensorboard-data-server: 0.7.2 - test-tube: 0.7.5 - torch: 2.2.0+rocm5.6 - torchaudio: 2.2.0+rocm5.6 - torchmetrics: 1.3.2 - torchvision: 0.17.0+rocm5.6 - tqdm: 4.66.2 - typing-extensions: 4.8.0 - tzdata: 2024.1 - urllib3: 1.26.13 - werkzeug: 3.0.1 - wheel: 0.41.2 - yarl: 1.9.4 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.10.14 - release: 5.14.0-162.18.1.el9_1.x86_64 - version: SMP PREEMPT_DYNAMIC Wed Mar 1 22:02:24 UTC 2023

More info

No response

jaydeepradeJD commented 1 month ago

Try using "srun python3 train.py". python --> python3

OswaldHe commented 1 month ago

I tried python3, but the issue still remains.

FelixBrakel commented 1 month ago

I have the same issue. It works fine when using srun but as a job submitted with sbatch it hangs.

Furkan9015 commented 1 month ago

A bottleneck for good especially if you can not do sruns but only sbatch within the environment you work.