Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.35k stars 3.38k forks source link

NCCL error of fabric when using dp or ddp strategy #18933

Closed Galaxy-Husky closed 11 months ago

Galaxy-Husky commented 1 year ago

Bug description

Hi!

I am studying the examples of lightning fabric. When I tried to run the script https://github.com/Lightning-AI/lightning/tree/master/examples/fabric/language_model with mutiple gpus using dp or ddp strategy, it raised some NCCL errors.

I'm not sure if the issue has to do with fabric or my NCCL, could you help me?

What version are you seeing the problem on?

v2.1

How to reproduce the bug

import lightning as L
import torch
import torch.nn.functional as F
from lightning.pytorch.demos import Transformer, WikiText2
from torch.utils.data import DataLoader, random_split

def main():
    L.seed_everything(42)

    fabric = L.Fabric(accelerator='cuda', strategy='ddp', devices=[0, 2])
    fabric.launch()

    # Data
    dataset = WikiText2()
    train_dataloader, val_dataloader, _ = get_dataloaders(dataset)

    # Model
    model = Transformer(vocab_size=dataset.vocab_size)

    # Optimizer
    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

    model, optimizer = fabric.setup(model, optimizer)
    train_dataloader, val_dataloader = fabric.setup_dataloaders(train_dataloader, val_dataloader)
    train(fabric, model, optimizer, train_dataloader, val_dataloader)

def train(fabric, model, optimizer, train_dataloader, val_dataloader, max_epochs=20):
    for epoch in range(max_epochs):
        train_epoch(fabric, model, optimizer, train_dataloader, epoch)
        val_loss = validate(fabric, model, val_dataloader)
        fabric.print(f"val loss {val_loss.item():.4f}")

def train_epoch(fabric, model, optimizer, train_dataloader, epoch):
    for batch_idx, batch in enumerate(train_dataloader):
        input, target = batch
        output = model(input, target)
        loss = F.nll_loss(output, target.view(-1))
        fabric.backward(loss)
        fabric.clip_gradients(model, optimizer, clip_val=0.25)
        optimizer.step()
        optimizer.zero_grad()

        if batch_idx % 200 == 0:
            fabric.print(f"epoch: {epoch} - iteration: {batch_idx} - loss {loss.item():.4f}")

@torch.no_grad()
def validate(fabric, model, val_dataloader):
    fabric.print("Validating ...")
    model.eval()
    losses = torch.zeros(len(val_dataloader))
    for k, batch in enumerate(val_dataloader):
        input, target = batch
        output = model(input, target)
        loss = F.nll_loss(output, target.view(-1))
        losses[k] = loss.item()
    out = losses.mean()
    model.train()
    return out

def get_dataloaders(dataset):
    n = len(dataset)
    generator = torch.Generator().manual_seed(42)
    train_dataset, val_dataset, test_dataset = random_split(dataset, [n - 4000, 2000, 2000], generator=generator)
    train_dataloader = DataLoader(train_dataset, batch_size=20, shuffle=True)
    val_dataloader = DataLoader(val_dataset, batch_size=20, shuffle=False)
    test_dataloader = DataLoader(test_dataset, batch_size=20, shuffle=False)
    return train_dataloader, val_dataloader, test_dataloader

if __name__ == "__main__":
    main()

Error messages and logs

# Error message for dp
Traceback (most recent call last):
  File "/home/ping/TVAE/train.py", line 76, in <module>
    main()
  File "/home/ping/TVAE/train.py", line 26, in main
    train(fabric, model, optimizer, train_dataloader, val_dataloader)
  File "/home/ping/TVAE/train.py", line 31, in train
    train_epoch(fabric, model, optimizer, train_dataloader, epoch)
  File "/home/ping/TVAE/train.py", line 39, in train_epoch
    output = model(input, target)
  File "/home/ping/mambaforge/envs/lightning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ping/mambaforge/envs/lightning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ping/mambaforge/envs/lightning/lib/python3.10/site-packages/lightning/fabric/wrappers.py", line 121, in forward
    output = self._forward_module(*args, **kwargs)
  File "/home/ping/mambaforge/envs/lightning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ping/mambaforge/envs/lightning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ping/mambaforge/envs/lightning/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 184, in forward
    replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
  File "/home/ping/mambaforge/envs/lightning/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 189, in replicate
    return replicate(module, device_ids, not torch.is_grad_enabled())
  File "/home/ping/mambaforge/envs/lightning/lib/python3.10/site-packages/torch/nn/parallel/replicate.py", line 110, in replicate
    param_copies = _broadcast_coalesced_reshape(params, devices, detach)
  File "/home/ping/mambaforge/envs/lightning/lib/python3.10/site-packages/torch/nn/parallel/replicate.py", line 83, in _broadcast_coalesced_reshape
    tensor_copies = Broadcast.apply(devices, *tensors)
  File "/home/ping/mambaforge/envs/lightning/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/ping/mambaforge/envs/lightning/lib/python3.10/site-packages/torch/nn/parallel/_functions.py", line 23, in forward
    outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
  File "/home/ping/mambaforge/envs/lightning/lib/python3.10/site-packages/torch/nn/parallel/comm.py", line 57, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 3: internal error - please report this issue to the NCCL developers

# Error message for ddp
Traceback (most recent call last):
  File "/home/ping/TVAE/train.py", line 76, in <module>                                           main()
  File "/home/ping/TVAE/train.py", line 24, in main                                               model, optimizer = fabric.setup(model, optimizer)
  File "/home/ping/mambaforge/envs/lightning/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 238, in setup
    module, optimizers = self._strategy.setup_module_and_optimizers(  # type: ignore[assignmen
t]                                                                                              File "/home/ping/mambaforge/envs/lightning/lib/python3.10/site-packages/lightning/fabric/str
ategies/strategy.py", line 157, in setup_module_and_optimizers
    module = self.setup_module(module)                                                          File "/home/ping/mambaforge/envs/lightning/lib/python3.10/site-packages/lightning/fabric/str
ategies/ddp.py", line 121, in setup_module
    return DistributedDataParallel(module=module, device_ids=device_ids, **self._ddp_kwargs)    File "/home/ping/mambaforge/envs/lightning/lib/python3.10/site-packages/torch/nn/parallel/di
stributed.py", line 795, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)                        File "/home/ping/mambaforge/envs/lightning/lib/python3.10/site-packages/torch/distributed/utils.py", line 265, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1695392020201/
work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, internal error - please report thi
s issue to the NCCL developers, NCCL version 2.18.5
ncclInternalError: Internal check failed.
Last error:
Attribute busid of node nic not found

Environment

Current environment * CUDA: - GPU: - NVIDIA A100-SXM4-40GB - NVIDIA A100-SXM4-40GB - NVIDIA A100-SXM4-40GB - NVIDIA A100-SXM4-40GB - available: True - version: 11.8 * Lightning: - lightning: 2.1.0 - lightning-cloud: 0.5.46 - lightning-utilities: 0.9.0 - pytorch-lightning: 2.0.7 - pytorch-optimizer: 2.12.0 - torch: 2.1.0 - torch-tb-profiler: 0.4.1 - torchaudio: 2.1.0 - torchinfo: 1.8.0 - torchmetrics: 1.0.3 - torchvision: 0.16.0 * Packages: - absl-py: 1.4.0 - aiohttp: 3.8.5 - aiosignal: 1.3.1 - alembic: 1.11.3 - annotated-types: 0.5.0 - anyio: 3.7.1 - argcomplete: 3.1.1 - arrow: 1.2.3 - asttokens: 2.2.1 - async-timeout: 4.0.3 - attrs: 23.1.0 - backcall: 0.2.0 - backoff: 2.2.1 - backports.functools-lru-cache: 1.6.5 - beautifulsoup4: 4.12.2 - blessed: 1.19.1 - blinker: 1.6.2 - blis: 0.7.10 - boto3: 1.28.76 - botocore: 1.31.76 - brotli: 1.0.9 - build: 0.10.0 - cachecontrol: 0.13.1 - cachetools: 5.3.1 - catalogue: 2.0.9 - certifi: 2023.7.22 - cffi: 1.15.1 - charset-normalizer: 3.2.0 - cleo: 2.0.1 - click: 8.1.7 - cmaes: 0.10.0 - colorama: 0.4.6 - colorlog: 6.7.0 - confection: 0.1.1 - contourpy: 1.1.0 - crashtest: 0.4.1 - croniter: 1.4.1 - cryptography: 41.0.3 - cupy: 12.2.0 - cycler: 0.11.0 - cymem: 2.0.7 - dataclasses: 0.8 - datasets: 2.14.4 - dateutils: 0.6.12 - decorator: 5.1.1 - deepdiff: 6.3.1 - dill: 0.3.7 - distlib: 0.3.7 - docstring-parser: 0.15 - dulwich: 0.21.5 - en-core-web-sm: 3.6.0 - exceptiongroup: 1.1.3 - executing: 1.2.0 - fastapi: 0.101.1 - fastrlock: 0.8 - filelock: 3.12.2 - fonttools: 4.42.1 - frozenlist: 1.4.0 - fsspec: 2023.6.0 - gmpy2: 2.1.2 - google-auth: 2.17.3 - google-auth-oauthlib: 1.0.0 - greenlet: 2.0.2 - grpcio: 1.56.2 - h11: 0.14.0 - huggingface-hub: 0.16.4 - idna: 3.4 - importlib-metadata: 6.8.0 - importlib-resources: 6.0.1 - inquirer: 3.1.3 - installer: 0.7.0 - ipdb: 0.13.13 - ipython: 8.14.0 - itsdangerous: 2.1.2 - jaraco.classes: 3.3.0 - jedi: 0.19.0 - jeepney: 0.8.0 - jinja2: 3.1.2 - jmespath: 1.0.1 - joblib: 1.3.2 - jsonargparse: 4.24.0 - jsonnet: 0.20.0 - jsonschema: 4.17.3 - keyring: 24.2.0 - kiwisolver: 1.4.5 - langcodes: 3.3.0 - lightning: 2.1.0 - lightning-cloud: 0.5.46 - lightning-utilities: 0.9.0 - mako: 1.2.4 - markdown: 3.4.4 - markdown-it-py: 3.0.0 - markupsafe: 2.1.3 - matplotlib: 3.7.2 - matplotlib-inline: 0.1.6 - mdurl: 0.1.0 - more-itertools: 10.1.0 - mpmath: 1.3.0 - msgpack: 1.0.5 - multidict: 6.0.4 - multiprocess: 0.70.15 - munkres: 1.1.4 - murmurhash: 1.0.9 - networkx: 3.1 - numpy: 1.25.2 - nvidia-ml-py: 12.535.77 - nvitop: 1.2.0 - oauthlib: 3.2.2 - optuna: 3.3.0 - ordered-set: 4.1.0 - orjson: 3.9.5 - packaging: 23.1 - pandas: 2.0.3 - parso: 0.8.3 - pathy: 0.10.2 - pexpect: 4.8.0 - pickleshare: 0.7.5 - pillow: 9.4.0 - pip: 23.2.1 - pkginfo: 1.9.6 - pkgutil-resolve-name: 1.3.10 - platformdirs: 3.10.0 - ply: 3.11 - poetry: 1.6.1 - poetry-core: 1.7.0 - poetry-plugin-export: 1.5.0 - preshed: 3.0.8 - prompt-toolkit: 3.0.39 - protobuf: 4.23.3 - psutil: 5.9.5 - ptyprocess: 0.7.0 - pure-eval: 0.2.2 - pyarrow: 12.0.1 - pyasn1: 0.4.8 - pyasn1-modules: 0.2.7 - pycparser: 2.21 - pydantic: 2.1.1 - pydantic-core: 2.4.0 - pygments: 2.16.1 - pyjwt: 2.8.0 - pyopenssl: 23.2.0 - pyparsing: 3.0.9 - pyproject-hooks: 1.0.0 - pyqt5: 5.15.9 - pyqt5-sip: 12.12.2 - pyrsistent: 0.19.3 - pysocks: 1.7.1 - python-dateutil: 2.8.2 - python-editor: 1.0.4 - python-multipart: 0.0.6 - pytorch-lightning: 2.0.7 - pytorch-optimizer: 2.12.0 - pytz: 2023.3 - pyu2f: 0.1.5 - pyyaml: 6.0.1 - rapidfuzz: 2.15.1 - readchar: 4.0.5.dev0 - regex: 2023.8.8 - requests: 2.31.0 - requests-oauthlib: 1.3.1 - requests-toolbelt: 1.0.0 - rich: 13.5.1 - rootdescent: 0.1.0 - rsa: 4.9 - s3transfer: 0.7.0 - sacremoses: 0.0.43 - safetensors: 0.3.3 - scikit-learn: 1.3.1 - scipy: 1.11.3 - secretstorage: 3.3.3 - setuptools: 68.1.2 - shellingham: 1.5.3 - sip: 6.7.11 - six: 1.16.0 - smart-open: 5.2.1 - snakeviz: 2.2.0 - sniffio: 1.3.0 - soupsieve: 2.3.2.post1 - spacy: 3.6.1 - spacy-legacy: 3.0.12 - spacy-loggers: 1.0.4 - sqlalchemy: 2.0.20 - srsly: 2.4.7 - stack-data: 0.6.2 - starlette: 0.27.0 - starsessions: 1.3.0 - sympy: 1.12 - tensorboard: 2.14.0 - tensorboard-data-server: 0.7.0 - termcolor: 2.3.0 - thinc: 8.1.12 - threadpoolctl: 3.2.0 - tokenizers: 0.14.1 - toml: 0.10.2 - tomli: 2.0.1 - tomlkit: 0.12.1 - torch: 2.1.0 - torch-tb-profiler: 0.4.1 - torchaudio: 2.1.0 - torchinfo: 1.8.0 - torchmetrics: 1.0.3 - torchvision: 0.16.0 - tornado: 6.3.3 - tqdm: 4.66.1 - traitlets: 5.9.0 - transformers: 4.35.0 - triton: 2.1.0 - trove-classifiers: 2023.8.7 - typer: 0.9.0 - typeshed-client: 2.3.0 - typing-extensions: 4.7.1 - tzdata: 2023.3 - unicodedata2: 15.0.0 - urllib3: 1.26.18 - uvicorn: 0.23.2 - validators: 0.21.2 - virtualenv: 20.24.3 - wasabi: 1.1.2 - wcwidth: 0.2.6 - websocket-client: 1.6.2 - websockets: 11.0.3 - werkzeug: 2.3.7 - wheel: 0.41.2 - xxhash: 0.0.0 - yarl: 1.9.2 - zipp: 3.16.2 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.10.12 - release: 6.2.0-32-generic - version: 32~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 18 10:40:13 UTC 2

More info

No response

carmocca commented 11 months ago

This must be due to something wrong in your hardware/environment/cluster. Unfortunately there's nothing we can do to help other than point you to https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug

Galaxy-Husky commented 11 months ago

I think the problem has something to do with pytorch 2.1.0 because when I downgraded pytorch to 2.0.1, the error disappeared. For those who have the same problem, please see the issue I submitted on pytorch https://github.com/pytorch/pytorch/issues/113245.