Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.91k stars 3.34k forks source link

DDP training timeout #19487

Open pengzhangzhi opened 6 months ago

pengzhangzhi commented 6 months ago

Bug description

I am using the default configs, code and data to train a model within BioNeMo framework. The timeout occurs at the middle of the training.

What version are you seeing the problem on?

v2.2

How to reproduce the bug

The configs might relate to the training:

trainer:
  devices: 8 # number of GPUs or CPUs
  num_nodes: 1 
  accelerator: gpu #gpu or cpu
  precision: 16 #16 or 32
  logger: False # logger is provided by NeMo exp_manager
  enable_checkpointing: False # checkpointing is done by NeMo exp_manager
  replace_sampler_ddp: False # use NeMo Megatron samplers
  max_epochs: null # # use max_steps instead with NeMo Megatron model
  log_every_n_steps: 10  # number of interations between logging
  val_check_interval: 15e4
  limit_val_batches: 50 # number of batches in validation step, use fraction for fraction of data, 0 to disable
  limit_test_batches: 500 # number of batches in test step, use fraction for fraction of data, 0 to disable
  accumulate_grad_batches: 1
  gradient_clip_val: 1.0
  benchmark: False
  max_steps: 500000

### Error messages and logs

Epoch 0: 6%|██ | 32040/500150 [6:28:43<94:39:17, 1.37it/s, loss=2.6, v_num=95nc, reduced_train_loss=2.590, global_step=3.2e+4, consumed_samples=2.56e+7][E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624886 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out. [E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out. [E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out. [E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800741 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800733 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800769 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out. [E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800847 milliseconds before timing out.


### Environment

a03-zpeng@m3dgx01:~$ pip list Package Version Location


absl-py 1.4.0 accessible-pygments 0.0.4 aiohttp 3.9.0 aiosignal 1.3.1 alabaster 0.7.13 aniso8601 9.0.1 annotated-types 0.6.0 antlr4-python3-runtime 4.9.3 apex 0.1 appdirs 1.4.4 argon2-cffi 21.3.0 argon2-cffi-bindings 21.2.0 asttokens 2.2.1 astunparse 1.6.3 async-timeout 4.0.2 attrdict 2.0.1 attrs 23.1.0 audioread 3.0.0 awscli 1.29.67 Babel 2.12.1 backcall 0.2.0 beautifulsoup4 4.12.2 bionemo 0.2.0.dev0 /workspace/bionemo biopandas 0.4.1 biopython 1.79 black 23.1.0 bleach 6.0.0 blinker 1.6.2 blis 0.7.9 boto3 1.28.10 botocore 1.31.67 braceexpand 0.1.7 Brotli 1.1.0 cachetools 5.3.1 catalogue 2.0.8 cdifflib 1.2.6 certifi 2023.7.22 cffi 1.15.1 cfgv 3.4.0 charset-normalizer 3.1.0 click 8.1.7 cloudpickle 2.2.1 cmake 3.24.1.1 colorama 0.4.4 coloredlogs 15.0.1 comm 0.1.3 commonmark 0.9.1 confection 0.0.4 contourpy 1.0.7 coverage 7.4.0 crc32c 2.3.post0 cubinlinker 0.3.0+2.g87b01ae cuda-python 12.1.0rc5+1.g38940ef cudf 23.4.0 cugraph 23.4.0 cugraph-dgl 23.4.0 cugraph-service-client 23.4.0 cugraph-service-server 23.4.0 cuml 23.4.0 cupy-cuda12x 12.0.0b3 cycler 0.11.0 cymem 2.0.7 Cython 0.29.35 dacite 1.8.1 dask 2023.3.2 dask-cuda 23.4.0 dask-cudf 23.4.0 debugpy 1.6.7 decorator 5.1.1 defusedxml 0.7.1 dgl 1.1.3 dgllife 0.2.8 diffdock 0.0.5 dill 0.3.7 Distance 0.1.3 distlib 0.3.8 distributed 2023.3.2.1 DLLogger 1.0.0 docker-pycreds 0.4.0 docopt 0.6.2 docutils 0.16 e3nn 0.5.1 editdistance 0.6.2 einops 0.6.1 exceptiongroup 1.1.1 execnet 1.9.0 executing 1.2.0 expecttest 0.1.3 fair-esm 2.0.0 faiss-cpu 1.7.4 fastjsonschema 2.17.1 fastrlock 0.8.1 fasttext 0.9.2 filelock 3.12.2 fire 0.5.0 flash-attn 1.0.7 Flask 2.2.5 Flask-RESTful 0.3.10 flatbuffers 23.5.26 fonttools 4.47.2 frozenlist 1.3.3 fsspec 2023.5.0 ftfy 6.1.1 future 0.18.3 g2p-en 2.1.0 gast 0.4.0 gdown 4.7.1 gevent 23.9.1 geventhttpclient 2.0.2 gitdb 4.0.10 GitPython 3.1.41 google-auth 2.20.0 google-auth-oauthlib 0.4.6 graphsurgeon 0.4.6 graphviz 0.20.1 greenlet 3.0.3 grpcio 1.56.0 h5py 3.9.0 huggingface-hub 0.20.2 humanfriendly 10.0 hydra-core 1.2.0 hyperopt 0.2.7 hypothesis 5.35.1 identify 2.5.33 idna 3.4 ijson 3.2.3 imagesize 1.4.1 importlib-metadata 6.6.0 inflect 7.0.0 iniconfig 2.0.0 intel-openmp 2021.4.0 ipadic 1.0.0 ipdb 0.13.11 ipykernel 6.23.3 ipython 8.14.0 ipython-genutils 0.2.0 ipywidgets 8.0.7 isort 5.12.0 itsdangerous 2.1.2 jedi 0.18.2 jieba 0.42.1 Jinja2 3.1.2 jiwer 2.5.2 jmespath 1.0.1 joblib 1.2.0 json5 0.9.14 jsonlines 4.0.0 jsonschema 4.17.3 jupyter_client 8.3.0 jupyter_core 5.3.1 jupyter-tensorboard 0.2.0 jupyterlab 2.3.2 jupyterlab-pygments 0.2.2 jupyterlab-server 1.2.0 jupyterlab-widgets 3.0.8 jupytext 1.14.6 k2 1.24.3.dev20230725+cuda12.1.torch2.1.0a0 kaldi-python-io 1.2.2 kaldiio 2.18.0 kiwisolver 1.4.4 kornia 0.6.12 langcodes 3.3.0 latexcodec 2.0.1 Levenshtein 0.21.1 librosa 0.9.2 lightning-utilities 0.9.0 llvmlite 0.39.1 locket 1.0.0 loguru 0.7.0 lxml 4.9.3 Markdown 3.4.3 markdown-it-py 2.2.0 markdown2 2.4.9 MarkupSafe 2.1.3 marshmallow 3.20.1 matplotlib 3.4.3 matplotlib-inline 0.1.6 mdit-py-plugins 0.4.0 mdurl 0.1.2 mecab-python3 1.0.5 megatron-core 0.2.0 mistune 3.0.1 mkl 2021.1.1 mkl-devel 2021.1.1 mkl-include 2021.1.1 mock 5.0.2 more-itertools 10.1.0 mpmath 0.19 msgpack 1.0.5 multidict 6.0.4 murmurhash 1.0.9 mypy-extensions 1.0.0 nbclient 0.8.0 nbconvert 7.6.0 nbformat 5.9.0 nemo-text-processing 0.1.8rc0 nemo-toolkit 1.20.0 nest-asyncio 1.5.6 networkx 2.6.3 ninja 1.11.1 nltk 3.8.1 nodeenv 1.8.0 notebook 6.4.10 numba 0.56.4+1.g5f1bc7084 numpy 1.22.2 nvidia-dali-cuda120 1.26.0 nvidia-pyindex 1.0.9 nvidia-pytriton 0.4.0 nvtx 0.2.5 oauthlib 3.2.2 omegaconf 2.2.3 onnx 1.14.1 onnx-graphsurgeon 0.3.27 onnxruntime-gpu 1.16.3 onnxscript 0.1.0.dev20240113 OpenCC 1.1.6 opencv 4.6.0 opt-einsum 3.3.0 opt-einsum-fx 0.1.4 packaging 23.1 pandas 1.5.2 pandocfilters 1.5.0 pangu 4.0.6.1 parameterized 0.9.0 parso 0.8.3 partd 1.4.0 pathspec 0.11.1 pathtools 0.1.2 pathy 0.10.2 pexpect 4.8.0 pickleshare 0.7.5 Pillow 10.0.1 pip 21.2.4 pipdeptree 2.13.0 plac 1.3.5 platformdirs 4.1.0 pluggy 1.2.0 ply 3.11 polars 0.16.7 polygraphy 0.47.1 pooch 1.7.0 portalocker 2.7.0 POT 0.7.0 pre-commit 3.4.0 preshed 3.0.8 prettytable 3.8.0 progress 1.6 prometheus-client 0.17.0 prompt-toolkit 3.0.38 protobuf 3.20.3 psutil 5.9.4 ptxcompiler 0.8.1+1.gbe9fca5 ptyprocess 0.7.0 pure-eval 0.2.2 py 1.11.0 py-cpuinfo 9.0.0 py4j 0.10.9.7 pyannote.core 5.0.0 pyannote.database 5.0.1 pyannote.metrics 3.2.1 pyarrow 14.0.1 pyasn1 0.5.0 pyasn1-modules 0.3.0 pybind11 2.10.4 pybtex 0.24.0 pybtex-docutils 1.0.2 pycocotools 2.0+nv0.7.3 pycparser 2.21 pydantic 2.5.3 pydantic_core 2.14.6 pydata-sphinx-theme 0.13.1 pydub 0.25.1 pyfaidx 0.7.2 pyfastx 1.1.0 Pygments 2.15.1 pylibcugraph 23.4.0 pylibcugraphops 23.4.0 pylibraft 23.4.0 Pympler 1.0.1 pynini 2.1.5 pynvml 11.4.1 pyparsing 3.0.9 pypinyin 0.49.0 pypinyin-dict 0.6.0 pyrsistent 0.19.3 PySocks 1.7.1 pytest 7.4.0 pytest-cov 4.1.0 pytest-dependency 0.5.1 pytest-forked 1.6.0 pytest-rerunfailures 11.1.2 pytest-runner 6.0.0 pytest-shard 0.1.2 pytest-timeout 2.2.0 pytest-xdist 3.3.1 python-dateutil 2.8.2 python-hostlist 1.23.0 python-rapidjson 1.14 python-slugify 8.0.1 pytorch-lightning 1.9.4 pytorch-quantization 2.1.2 pytz 2023.3 PyYAML 6.0 pyzmq 23.2.1 raft-dask 23.4.0 rapidfuzz 2.13.7 rdkit 2023.9.1 rdkit-pypi 2022.9.5 regex 2023.6.3 requests 2.31.0 requests-mock 1.11.0 requests-oauthlib 1.3.1 resampy 0.4.2 rich 12.6.0 rmm 23.4.0 rouge-score 0.1.2 rsa 4.7.2 ruamel.yaml 0.17.32 ruamel.yaml.clib 0.2.7 ruff 0.0.292 s3transfer 0.7.0 sacrebleu 2.3.1 sacremoses 0.0.53 safetensors 0.3.1 scikit-learn 1.2.0 scipy 1.10.1 seaborn 0.12.2 Send2Trash 1.8.2 sentence-transformers 2.2.2 sentencepiece 0.1.99 sentry-sdk 1.28.1 setproctitle 1.3.2 setuptools 65.5.1 sh 1.14.3 shellingham 1.5.0.post1 six 1.16.0 smart-open 6.3.0 smmap 5.0.0 snowballstemmer 2.2.0 sortedcontainers 2.4.0 soundfile 0.12.1 soupsieve 2.4.1 sox 1.4.1 spacy 3.5.3 spacy-legacy 3.0.12 spacy-loggers 1.0.4 Sphinx 5.3.0 sphinx-book-theme 1.0.0 sphinx-copybutton 0.5.2 sphinx-glpi-theme 0.3 sphinxcontrib-applehelp 1.0.4 sphinxcontrib-bibtex 2.5.0 sphinxcontrib-devhelp 1.0.2 sphinxcontrib-htmlhelp 2.0.1 sphinxcontrib-jsmath 1.0.1 sphinxcontrib-qthelp 1.0.3 sphinxcontrib-serializinghtml 1.1.5 sphinxext-opengraph 0.8.2 spyrmsd 0.5.2 srsly 2.4.6 stack-data 0.6.2 sympy 1.12 tabulate 0.9.0 tbb 2021.9.0 tblib 1.7.0 tensorboard 2.9.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 tensorrt 8.6.1 termcolor 2.3.0 terminado 0.17.1 testbook 0.4.2 text-unidecode 1.3 textdistance 4.5.0 texterrors 0.4.4 tfrecord 1.14.1 thinc 8.1.10 threadpoolctl 3.1.0 thriftpy2 0.4.16 tinycss2 1.2.1 tokenizers 0.15.0 toml 0.10.2 tomli 2.0.1 toolz 0.12.0 torch 2.1.0a0+4136153 torch-cluster 1.6.1 torch-geometric 2.3.0 torch-scatter 2.0.9 torch-sparse 0.6.17 torch-tensorrt 1.5.0.dev0 torchaudio 2.1.0 torchdata 0.7.0a0 torchmetrics 1.0.1 torchvision 0.16.0a0 tornado 6.3.2 tqdm 4.65.0 traitlets 5.9.0 transformer-engine 0.9.0 transformers 4.36.0 treelite 3.2.0 treelite-runtime 3.2.0 triton 2.0.0.dev20221202 triton-model-navigator 0.7.4 tritonclient 2.41.1 typed-ast 1.5.5 typer 0.7.0 types-dataclasses 0.6.6 typing_extensions 4.6.3 typing-inspect 0.6.0 ucx-py 0.31.0 uff 0.6.9 urllib3 1.26.16 virtualenv 20.25.0 wandb 0.15.6 wasabi 1.1.2 wcwidth 0.2.6 webdataset 0.2.33 webencodings 0.5.1 Werkzeug 2.3.6 wget 3.2 wheel 0.40.0 widgetsnbextension 4.0.8 wrapt 1.14.1 xdoctest 1.0.2 xgboost 1.7.5 yarl 1.9.2 youtokentome 1.0.6 zict 3.0.0 zipp 3.15.0 zope.event 5.0 zope.interface 6.1



### More info

I am using the nvidia BioNeMo framework.
awaelchli commented 6 months ago

@pengzhangzhi Can you describe the steps to reproduce this? There are several notebooks in the examples folder https://github.com/NVIDIA/BioNeMo/tree/main/examples/service/notebooks but I doubt you are running these. Where is the code and config that you are running?

pengzhangzhi commented 6 months ago

Hi @awaelchli, the github repo does not have the whole training code, I got it from their docker containers. If you want to reproduce their code, here is the doc https://docs.nvidia.com/bionemo-framework/latest/quickstart-fw.html I am afraid it is too much work for you to reproduce it since you have to download and prepare all the data. My problem is kind of tricky that the nccl timeout happens during the training, sometimes earlier sometimes later, but it just out of nowhere. I would like to know how to track down the errors bc I have no clue given the error log. I have tried many solutions but no luck, such as:

  1. increase the timeout to a year
  2. set a bunch of nccl variables.

export NCCL_DEBUG=INFO

export NCCL_P2P_DISABLE=1 export NCCL_P2P_LEVEL=NVL export NCCL_IB_GID_INDEX=3

pengzhangzhi commented 6 months ago

If you insist on reproducing it, I am happy to help and give a detailed guide. For simplicity, it would be great to guide me how to debug this error. Thanks!!

pengzhangzhi commented 5 months ago

Running into the same problem. I think it is hardware-independent. The code here uses the pytorch-lightning and NeMo frameworks. It happens after 8 hours of training.

image

awaelchli commented 5 months ago

Hey @pengzhangzhi Sorry, lots going on recently, trying to balance priorities. Sorry for missing or delayed replies.

I implemented a system check utility to help with such problems: Feel free to test it out if you have the time https://github.com/Lightning-AI/pytorch-lightning/pull/19609 The idea of this system check is that it is implemented in raw PyTorch, so if issues arise we would know whether it is in Lightning or not.

pengzhangzhi commented 5 months ago

Thanks! Exactly what I need for debugging! Do u have a document for how to use this tool? The link in that pr isn't valid :( 📚 Documentation preview 📚: pytorch-lightning--19609.org.readthedocs.build/en/19609

awaelchli commented 5 months ago

Yes the docs will only generate once the PR is ready.

The easiest for you to try it right now is to just copy this file https://github.com/Lightning-AI/pytorch-lightning/blob/feature/system-check/src/lightning/fabric/utilities/system_check.py locally and just run it as python system_check.py. It doesn't require any dependencies other than torch and psutil.

pengzhangzhi commented 5 months ago

Thanks!! Here is log... I can't make sense of it.. FYI, the code that I am having a problem with has been employed in two systems and both of them have the same timeout problem. Additionally I have another pytorch-lightning, torch DDP project, it works well in the current system without any timeout error.

Below is the output of `nvidia-smi`. It shows information about the GPUs that are installed on this machine, the driver version, and the maximum supported CUDA version it can run.

Thu Mar 14 17:08:07 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off | 00000000:07:00.0 Off |                    0 |
| N/A   24C    P0              59W / 400W |      7MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off | 00000000:0F:00.0 Off |                    0 |
| N/A   22C    P0              55W / 400W |      7MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          Off | 00000000:47:00.0 Off |                    0 |
| N/A   22C    P0              58W / 400W |      7MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          Off | 00000000:4E:00.0 Off |                    0 |
| N/A   22C    P0              58W / 400W |      7MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          Off | 00000000:87:00.0 Off |                    0 |
| N/A   29C    P0              59W / 400W |      7MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          Off | 00000000:90:00.0 Off |                    0 |
| N/A   27C    P0              59W / 400W |      7MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          Off | 00000000:B7:00.0 Off |                    0 |
| N/A   51C    P0             259W / 400W |  80363MiB / 81920MiB |     90%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          Off | 00000000:BD:00.0 Off |                    0 |
| N/A   50C    P0             213W / 400W |  76741MiB / 81920MiB |     71%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

The matrix below shows how the GPUs in this machine are connected. NVLink (NV) is the fastest connection, and is only available on high-end systems like V100, A100, etc.

    GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    NIC10   NIC11   CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  NV12    NV12    NV12    NV12    NV12    NV12    NV12    PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191   3       N/A
GPU1    NV12     X  NV12    NV12    NV12    NV12    NV12    NV12    PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191   3       N/A
GPU2    NV12    NV12     X  NV12    NV12    NV12    NV12    NV12    SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 16-31,144-159   1       N/A
GPU3    NV12    NV12    NV12     X  NV12    NV12    NV12    NV12    SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 16-31,144-159   1       N/A
GPU4    NV12    NV12    NV12    NV12     X  NV12    NV12    NV12    SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7       N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X  NV12    NV12    SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7       N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X  NV12    SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223   5       N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X  SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223   5       N/A
NIC0    PXB PXB SYS SYS SYS SYS SYS SYS  X  PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS             
NIC1    PXB PXB SYS SYS SYS SYS SYS SYS PXB  X  SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS             
NIC2    SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS  X  PXB SYS SYS SYS SYS SYS SYS SYS SYS             
NIC3    SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB  X  SYS SYS SYS SYS SYS SYS SYS SYS             
NIC4    SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS  X  PIX SYS SYS SYS SYS SYS SYS             
NIC5    SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX  X  SYS SYS SYS SYS SYS SYS             
NIC6    SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS  X  PXB SYS SYS SYS SYS             
NIC7    SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS PXB  X  SYS SYS SYS SYS             
NIC8    SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS  X  PXB SYS SYS             
NIC9    SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS PXB  X  SYS SYS             
NIC10   SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS  X  PIX             
NIC11   SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX  X              

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9
  NIC10: mlx5_10
  NIC11: mlx5_11

NCCL version 2.18.1+cuda12.1
Traceback (most recent call last):
  File "/workspace/bionemo/debug.py", line 179, in <module>
    main()
  File "/workspace/bionemo/debug.py", line 48, in main
    success = _check_cuda_distributed(timeout)
  File "/workspace/bionemo/debug.py", line 84, in _check_cuda_distributed
    success = context.join(timeout=5)
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 6 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/workspace/bionemo/debug.py", line 116, in _run_all_reduce_test
    torch.distributed.barrier()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 145, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3553, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1164, internal error - please report this issue to the NCCL developers, NCCL version 2.18.1
ncclInternalError: Internal check failed.
Last error:
Socket recv failed while polling for opId=0x7f8be9d30b00
awaelchli commented 5 months ago

This output shows that distributed PyTorch won't work on your system. It can't synchronize at the barrier, which is a very basic requirement.

There should be a folder system_check, it might have additional logs from NCCL with warnings. In rare occasions, a driver update or downgrade can help, or reinstalling PyTorch in a fresh environment.

pengzhangzhi commented 5 months ago

Thanks!!

Since the error is in process 6, I am showing the log of nccl-rank-6 below:

pbg-dgx-1:1243335:1243335 [6] NCCL INFO cudaDriverVersion 12020
pbg-dgx-1:1243335:1243335 [6] NCCL INFO Bootstrap : Using enp226s0:10.148.54.242<0>
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
pbg-dgx-1:1243335:1244249 [6] NCCL INFO P2P plugin IBext
pbg-dgx-1:1243335:1244249 [6] NCCL INFO NET/IB : No device found.
pbg-dgx-1:1243335:1244249 [6] NCCL INFO NET/IB : No device found.
pbg-dgx-1:1243335:1244249 [6] NCCL INFO NET/Socket : Using [0]enp226s0:10.148.54.242<0> [1]vethe42dbfb:fe80::a0ec:26ff:fe84:f001%vethe42dbfb<0> [2]vethd8013ee:fe80::f8e4:5dff:febd:1409%vethd8013ee<0> [3]vethfa79788:fe80::6813:94ff:fe9a:17e4%vethfa79788<0>
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Using network Socket
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Setting affinity for GPU 6 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000
pbg-dgx-1:1243335:1244249 [6] NCCL INFO NVLS multicast support is not available on dev 6
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
pbg-dgx-1:1243335:1244249 [6] NCCL INFO P2P Chunksize set to 524288
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Channel 00/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Channel 01/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Channel 02/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Channel 03/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Channel 04/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Channel 05/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
pbg-dgx-1:1243335:1244249 [6] NCCL INFO Channel 06/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read

pbg-dgx-1:1243335:1244302 [6] include/alloc.h:178 NCCL WARN Cuda failure 'out of memory'

pbg-dgx-1:1243335:1244302 [6] include/alloc.h:185 NCCL WARN Failed to CUDA calloc 6291456 bytes
pbg-dgx-1:1243335:1244302 [6] NCCL INFO transport/p2p.cc:204 -> 1
pbg-dgx-1:1243335:1244302 [6] NCCL INFO transport/p2p.cc:584 -> 1
pbg-dgx-1:1243335:1244302 [6] NCCL INFO proxy.cc:1303 -> 1
pbg-dgx-1:1243335:1244302 [6] NCCL INFO proxy.cc:1377 -> 1

pbg-dgx-1:1243335:1244302 [6] proxy.cc:1518 NCCL WARN [Proxy Service 6] Failed to execute operation Setup from rank 6, retcode 1

pbg-dgx-1:1243335:1244249 [6] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer pbg-dgx-1.egr.duke.edu<50889>
pbg-dgx-1:1243335:1244249 [6] NCCL INFO misc/socket.cc:746 -> 6

pbg-dgx-1:1243335:1244249 [6] proxy.cc:1143 NCCL WARN Socket recv failed while polling for opId=0x7f8be9d30b00
pbg-dgx-1:1243335:1244249 [6] NCCL INFO transport/p2p.cc:386 -> 3
pbg-dgx-1:1243335:1244249 [6] NCCL INFO transport.cc:33 -> 3
pbg-dgx-1:1243335:1244249 [6] NCCL INFO transport.cc:106 -> 3
pbg-dgx-1:1243335:1244249 [6] NCCL INFO init.cc:1032 -> 3
pbg-dgx-1:1243335:1244249 [6] NCCL INFO init.cc:1309 -> 3
pbg-dgx-1:1243335:1244249 [6] NCCL INFO group.cc:64 -> 3 [Async thread]
pbg-dgx-1:1243335:1243335 [6] NCCL INFO group.cc:422 -> 3
pbg-dgx-1:1243335:1243335 [6] NCCL INFO group.cc:106 -> 3
pbg-dgx-1:1243335:1243335 [6] NCCL INFO comm 0x55a0a637e960 rank 6 nranks 8 cudaDev 6 busId b7000 - Abort COMPLETE

FYI, I am using a docker container. It can be reproduced by the following steps.

docker login nvcr.io
Username: $oauthtoken
Password NGc3bWIxM21mbTI0dTBraHE5N2U0NG1saWg6ZTY4MzlhZmUtYTJlZC00NDVmLThjYmEtNjA2ZTMzMzRkZTYy

Pull the Bionemo container:

docker pull nvcr.io/nvidia/clara/bionemo-framework:1.2

Run the container:

CONTAINER="nvcr.io/nvidia/clara/bionemo-framework:1.2"
DEST_PATH="."
CONTAINER_NAME=bionemo
docker run --name $CONTAINER_NAME -itd --rm $CONTAINER bash

To reproduce my error: You can copy this file feature/system-check/src/lightning/fabric/utilities/system_check.py to the container and run it.

awaelchli commented 5 months ago

I won't have the bandwidth to help much here. Maybe try disabling plugins: NCCL_NET_PLUGIN=none. And if you are running inside the docker, please make sure that it's picking the correct network interface. Run the system check outside the container in a clean environment to see if it's related to the container or not.

pengzhangzhi commented 5 months ago

I think the problem I have on nccl-rank-6 is just OOM based on the log?

pbg-dgx-1:1243335:1244302 [6] include/alloc.h:178 NCCL WARN Cuda failure 'out of memory'
awaelchli commented 5 months ago

if you ran my system check, that's not possible. It allocates very little memory on the GPU: https://github.com/Lightning-AI/pytorch-lightning/blob/297e9809d2da7ec2abb0f2e7c5e6c371ae0eaac8/src/lightning/fabric/utilities/system_check.py#L118

If what you show me there is the output of another program, then yes it looks like one rank runs out of memory. If one rank dies, the others will wait and hang forever.

pengzhangzhi commented 5 months ago

Yeah. I think it is because some of the GPUs are already heavily utilized and triggers the OOM problem as shown in the log. I only ran your program in the container and the host, both of log showing OOM on two utilized GPUs.