Closed Galaxy-Husky closed 11 months ago
This must be due to something wrong in your hardware/environment/cluster. Unfortunately there's nothing we can do to help other than point you to https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug
I think the problem has something to do with pytorch 2.1.0 because when I downgraded pytorch to 2.0.1, the error disappeared. For those who have the same problem, please see the issue I submitted on pytorch https://github.com/pytorch/pytorch/issues/113245.
Bug description
Hi!
I am studying the examples of lightning fabric. When I tried to run the script https://github.com/Lightning-AI/lightning/tree/master/examples/fabric/language_model with mutiple gpus using dp or ddp strategy, it raised some NCCL errors.
I'm not sure if the issue has to do with fabric or my NCCL, could you help me?
What version are you seeing the problem on?
v2.1
How to reproduce the bug
Error messages and logs
Environment
Current environment
* CUDA: - GPU: - NVIDIA A100-SXM4-40GB - NVIDIA A100-SXM4-40GB - NVIDIA A100-SXM4-40GB - NVIDIA A100-SXM4-40GB - available: True - version: 11.8 * Lightning: - lightning: 2.1.0 - lightning-cloud: 0.5.46 - lightning-utilities: 0.9.0 - pytorch-lightning: 2.0.7 - pytorch-optimizer: 2.12.0 - torch: 2.1.0 - torch-tb-profiler: 0.4.1 - torchaudio: 2.1.0 - torchinfo: 1.8.0 - torchmetrics: 1.0.3 - torchvision: 0.16.0 * Packages: - absl-py: 1.4.0 - aiohttp: 3.8.5 - aiosignal: 1.3.1 - alembic: 1.11.3 - annotated-types: 0.5.0 - anyio: 3.7.1 - argcomplete: 3.1.1 - arrow: 1.2.3 - asttokens: 2.2.1 - async-timeout: 4.0.3 - attrs: 23.1.0 - backcall: 0.2.0 - backoff: 2.2.1 - backports.functools-lru-cache: 1.6.5 - beautifulsoup4: 4.12.2 - blessed: 1.19.1 - blinker: 1.6.2 - blis: 0.7.10 - boto3: 1.28.76 - botocore: 1.31.76 - brotli: 1.0.9 - build: 0.10.0 - cachecontrol: 0.13.1 - cachetools: 5.3.1 - catalogue: 2.0.9 - certifi: 2023.7.22 - cffi: 1.15.1 - charset-normalizer: 3.2.0 - cleo: 2.0.1 - click: 8.1.7 - cmaes: 0.10.0 - colorama: 0.4.6 - colorlog: 6.7.0 - confection: 0.1.1 - contourpy: 1.1.0 - crashtest: 0.4.1 - croniter: 1.4.1 - cryptography: 41.0.3 - cupy: 12.2.0 - cycler: 0.11.0 - cymem: 2.0.7 - dataclasses: 0.8 - datasets: 2.14.4 - dateutils: 0.6.12 - decorator: 5.1.1 - deepdiff: 6.3.1 - dill: 0.3.7 - distlib: 0.3.7 - docstring-parser: 0.15 - dulwich: 0.21.5 - en-core-web-sm: 3.6.0 - exceptiongroup: 1.1.3 - executing: 1.2.0 - fastapi: 0.101.1 - fastrlock: 0.8 - filelock: 3.12.2 - fonttools: 4.42.1 - frozenlist: 1.4.0 - fsspec: 2023.6.0 - gmpy2: 2.1.2 - google-auth: 2.17.3 - google-auth-oauthlib: 1.0.0 - greenlet: 2.0.2 - grpcio: 1.56.2 - h11: 0.14.0 - huggingface-hub: 0.16.4 - idna: 3.4 - importlib-metadata: 6.8.0 - importlib-resources: 6.0.1 - inquirer: 3.1.3 - installer: 0.7.0 - ipdb: 0.13.13 - ipython: 8.14.0 - itsdangerous: 2.1.2 - jaraco.classes: 3.3.0 - jedi: 0.19.0 - jeepney: 0.8.0 - jinja2: 3.1.2 - jmespath: 1.0.1 - joblib: 1.3.2 - jsonargparse: 4.24.0 - jsonnet: 0.20.0 - jsonschema: 4.17.3 - keyring: 24.2.0 - kiwisolver: 1.4.5 - langcodes: 3.3.0 - lightning: 2.1.0 - lightning-cloud: 0.5.46 - lightning-utilities: 0.9.0 - mako: 1.2.4 - markdown: 3.4.4 - markdown-it-py: 3.0.0 - markupsafe: 2.1.3 - matplotlib: 3.7.2 - matplotlib-inline: 0.1.6 - mdurl: 0.1.0 - more-itertools: 10.1.0 - mpmath: 1.3.0 - msgpack: 1.0.5 - multidict: 6.0.4 - multiprocess: 0.70.15 - munkres: 1.1.4 - murmurhash: 1.0.9 - networkx: 3.1 - numpy: 1.25.2 - nvidia-ml-py: 12.535.77 - nvitop: 1.2.0 - oauthlib: 3.2.2 - optuna: 3.3.0 - ordered-set: 4.1.0 - orjson: 3.9.5 - packaging: 23.1 - pandas: 2.0.3 - parso: 0.8.3 - pathy: 0.10.2 - pexpect: 4.8.0 - pickleshare: 0.7.5 - pillow: 9.4.0 - pip: 23.2.1 - pkginfo: 1.9.6 - pkgutil-resolve-name: 1.3.10 - platformdirs: 3.10.0 - ply: 3.11 - poetry: 1.6.1 - poetry-core: 1.7.0 - poetry-plugin-export: 1.5.0 - preshed: 3.0.8 - prompt-toolkit: 3.0.39 - protobuf: 4.23.3 - psutil: 5.9.5 - ptyprocess: 0.7.0 - pure-eval: 0.2.2 - pyarrow: 12.0.1 - pyasn1: 0.4.8 - pyasn1-modules: 0.2.7 - pycparser: 2.21 - pydantic: 2.1.1 - pydantic-core: 2.4.0 - pygments: 2.16.1 - pyjwt: 2.8.0 - pyopenssl: 23.2.0 - pyparsing: 3.0.9 - pyproject-hooks: 1.0.0 - pyqt5: 5.15.9 - pyqt5-sip: 12.12.2 - pyrsistent: 0.19.3 - pysocks: 1.7.1 - python-dateutil: 2.8.2 - python-editor: 1.0.4 - python-multipart: 0.0.6 - pytorch-lightning: 2.0.7 - pytorch-optimizer: 2.12.0 - pytz: 2023.3 - pyu2f: 0.1.5 - pyyaml: 6.0.1 - rapidfuzz: 2.15.1 - readchar: 4.0.5.dev0 - regex: 2023.8.8 - requests: 2.31.0 - requests-oauthlib: 1.3.1 - requests-toolbelt: 1.0.0 - rich: 13.5.1 - rootdescent: 0.1.0 - rsa: 4.9 - s3transfer: 0.7.0 - sacremoses: 0.0.43 - safetensors: 0.3.3 - scikit-learn: 1.3.1 - scipy: 1.11.3 - secretstorage: 3.3.3 - setuptools: 68.1.2 - shellingham: 1.5.3 - sip: 6.7.11 - six: 1.16.0 - smart-open: 5.2.1 - snakeviz: 2.2.0 - sniffio: 1.3.0 - soupsieve: 2.3.2.post1 - spacy: 3.6.1 - spacy-legacy: 3.0.12 - spacy-loggers: 1.0.4 - sqlalchemy: 2.0.20 - srsly: 2.4.7 - stack-data: 0.6.2 - starlette: 0.27.0 - starsessions: 1.3.0 - sympy: 1.12 - tensorboard: 2.14.0 - tensorboard-data-server: 0.7.0 - termcolor: 2.3.0 - thinc: 8.1.12 - threadpoolctl: 3.2.0 - tokenizers: 0.14.1 - toml: 0.10.2 - tomli: 2.0.1 - tomlkit: 0.12.1 - torch: 2.1.0 - torch-tb-profiler: 0.4.1 - torchaudio: 2.1.0 - torchinfo: 1.8.0 - torchmetrics: 1.0.3 - torchvision: 0.16.0 - tornado: 6.3.3 - tqdm: 4.66.1 - traitlets: 5.9.0 - transformers: 4.35.0 - triton: 2.1.0 - trove-classifiers: 2023.8.7 - typer: 0.9.0 - typeshed-client: 2.3.0 - typing-extensions: 4.7.1 - tzdata: 2023.3 - unicodedata2: 15.0.0 - urllib3: 1.26.18 - uvicorn: 0.23.2 - validators: 0.21.2 - virtualenv: 20.24.3 - wasabi: 1.1.2 - wcwidth: 0.2.6 - websocket-client: 1.6.2 - websockets: 11.0.3 - werkzeug: 2.3.7 - wheel: 0.41.2 - xxhash: 0.0.0 - yarl: 1.9.2 - zipp: 3.16.2 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.10.12 - release: 6.2.0-32-generic - version: 32~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 18 10:40:13 UTC 2More info
No response