NCCL timeout at all-reduce in TinyLlama when resuming from checkpoint

TeddLi commented 10 months ago

Bug description

I hit this when resume my checkpoint:

finish 885000, taken 20619.887254806003 seconds
finish 886000, taken 20642.44063494797 seconds
finish 887000, taken 20667.420082670986 seconds
finish 888000, taken 20688.093257904984 seconds
finish 889000, taken 20711.041406492994 seconds
finish 890000, taken 20733.065596028 seconds
[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800995 milliseconds before timing out.
resume finished, taken 22534.10308145301 seconds
g3-xlarge-x86-dal-1:118117:119299 [0] NCCL INFO [Service thread] Connection closed by localRank 0
g3-xlarge-x86-dal-1:118117:119262 [0] NCCL INFO comm 0x559cd78fbad0 rank 0 nranks 8 cudaDev 0 busId 1000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800995 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800995 milliseconds before timing out.
run.sh: line 16: 118117 Aborted                 (core dumped) python3 pretrain/tinyllama.py --devices 8 --train_data_dir /home/ubuntu/falcon-refinedweb/falcon_processed --resume /home/ubuntu/testing/out/tiny-llama-185m_8gpu/iter-890000-ckpt.pth

What version are you seeing the problem on?

master

How to reproduce the bug

I am using the tinyllama repo. 
I believe the error is com from here
https://github.com/jzhang38/TinyLlama/blob/11a02ce085c1670bd009e6d4385701ff06a7f6cf/pretrain/tinyllama.py#L207

Error messages and logs

finish 885000, taken 20619.887254806003 seconds
finish 886000, taken 20642.44063494797 seconds
finish 887000, taken 20667.420082670986 seconds
finish 888000, taken 20688.093257904984 seconds
finish 889000, taken 20711.041406492994 seconds
finish 890000, taken 20733.065596028 seconds
[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800995 milliseconds before timing out.
resume finished, taken 22534.10308145301 seconds
g3-xlarge-x86-dal-1:118117:119299 [0] NCCL INFO [Service thread] Connection closed by localRank 0
g3-xlarge-x86-dal-1:118117:119262 [0] NCCL INFO comm 0x559cd78fbad0 rank 0 nranks 8 cudaDev 0 busId 1000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800995 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800995 milliseconds before timing out.
run.sh: line 16: 118117 Aborted                 (core dumped) python3 pretrain/tinyllama.py --devices 8 --train_data_dir /home/ubuntu/falcon-refinedweb/falcon_processed --resume /home/ubuntu/testing/out/tiny-llama-185m_8gpu/iter-890000-ckpt.pth

Environment

Saving to: ‘collect_env_details.py’

collect_env_details.py                         100%[===================================================================================================>]   2.70K  --.-KB/s    in 0s      

2024-01-31 04:31:18 (55.0 MB/s) - ‘collect_env_details.py’ saved [2760/2760]

/usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 2.22.1ubuntu1 is an invalid version and will not be supported in a future release
  warnings.warn(
<details>
  <summary>Current environment</summary>

* CUDA:
    - GPU:
        - NVIDIA H100 PCIe
        - NVIDIA H100 PCIe
        - NVIDIA H100 PCIe
        - NVIDIA H100 PCIe
        - NVIDIA H100 PCIe
        - NVIDIA H100 PCIe
        - NVIDIA H100 PCIe
        - NVIDIA H100 PCIe
    - available:         True
    - version:           11.8
* Lightning:
    - lightning:         2.1.2
    - lightning-cloud:   0.5.52
    - lightning-utilities: 0.10.1
    - pytorch-lightning: 2.1.3
    - pytorch-triton:    3.0.0+901819d2b6
    - rotary-embedding-torch: 0.5.3
    - torch:             2.1.2+cu118
    - torchaudio:        2.1.2+cu118
    - torchmetrics:      1.3.0.post0
    - torchvision:       0.16.2+cu118
* Packages:
    - absl-py:           2.1.0
    - accelerate:        0.21.0
    - aiohttp:           3.9.0
    - aiosignal:         1.3.1
    - annotated-types:   0.6.0
    - ansible:           6.7.0
    - ansible-core:      2.13.13
    - ansible-vault:     2.1.0
    - antlr4-python3-runtime: 4.9.3
    - anyio:             4.2.0
    - appdirs:           1.4.4
    - arrow:             1.3.0
    - async-timeout:     4.0.3
    - attrs:             21.2.0
    - automat:           20.2.0
    - babel:             2.8.0
    - backoff:           2.2.1
    - base58:            2.1.1
    - bcrypt:            3.2.0
    - beartype:          0.16.4
    - beautifulsoup4:    4.12.3
    - bitsandbytes:      0.40.0
    - bittensor:         6.6.1
    - black:             23.7.0
    - blessed:           1.20.0
    - blinker:           1.4
    - boto3:             1.34.27
    - botocore:          1.34.27
    - cachetools:        5.3.2
    - certifi:           2020.6.20
    - cffi:              1.16.0
    - chardet:           4.0.0
    - charset-normalizer: 3.3.2
    - click:             8.0.3
    - cloud-init:        23.3.3
    - colorama:          0.4.4
    - colorclass:        2.2.2
    - command-not-found: 0.3
    - configobj:         5.0.6
    - constantly:        15.1.0
    - contourpy:         1.2.0
    - croniter:          1.4.1
    - cryptography:      42.0.1
    - cycler:            0.12.1
    - cytoolz:           0.12.3
    - datasets:          2.16.1
    - dateutils:         0.6.12
    - dbus-python:       1.2.18
    - ddt:               1.6.0
    - decorator:         5.1.1
    - deepdiff:          6.7.1
    - deepspeed:         0.13.0
    - devscripts:        2.22.1ubuntu1
    - dill:              0.3.7
    - distro:            1.7.0
    - distro-info:       1.1+ubuntu0.1
    - docker-pycreds:    0.4.0
    - docopt:            0.6.2
    - docstring-parser:  0.15
    - dropout-layer-norm: 0.1
    - ecdsa:             0.18.0
    - editor:            1.6.5
    - einops:            0.6.1
    - eth-hash:          0.6.0
    - eth-keys:          0.5.0
    - eth-typing:        4.0.0
    - eth-utils:         2.3.1
    - evaluate:          0.4.0
    - exceptiongroup:    1.2.0
    - fastapi:           0.99.1
    - filelock:          3.13.1
    - flash-attn:        2.5.0
    - fonttools:         4.47.2
    - frozenlist:        1.4.1
    - fsspec:            2023.10.0
    - ft-attention:      0.1
    - fused-dense-lib:   0.0.0
    - fused-softmax-lib: 0.0.0
    - fuzzywuzzy:        0.18.0
    - gitdb:             4.0.11
    - gitpython:         3.1.41
    - google-auth:       2.26.2
    - google-auth-oauthlib: 1.2.0
    - gpg:               1.16.0
    - grpcio:            1.60.0
    - h11:               0.14.0
    - hjson:             3.1.0
    - httplib2:          0.20.2
    - huggingface-hub:   0.20.3
    - hydra-core:        1.3.2
    - hyperlink:         21.0.0
    - idna:              3.3
    - importlib-metadata: 4.6.4
    - importlib-resources: 6.1.1
    - incremental:       21.3.0
    - iniconfig:         2.0.0
    - inquirer:          3.2.1
    - itsdangerous:      2.1.2
    - jeepney:           0.7.1
    - jinja2:            3.0.3
    - jmespath:          1.0.1
    - joblib:            1.3.2
    - jsonargparse:      4.27.2
    - jsonpatch:         1.32
    - jsonpointer:       2.0
    - jsonschema:        3.2.0
    - keyring:           23.5.0
    - kiwisolver:        1.4.5
    - launchpadlib:      1.10.16
    - lazr.restfulclient: 0.14.4
    - lazr.uri:          1.0.6
    - levenshtein:       0.23.0
    - lightning:         2.1.2
    - lightning-cloud:   0.5.52
    - lightning-utilities: 0.10.1
    - loguru:            0.7.0
    - markdown:          3.5.2
    - markdown-it-py:    3.0.0
    - markupsafe:        2.1.4
    - matplotlib:        3.8.2
    - mdurl:             0.1.2
    - more-itertools:    8.10.0
    - mpmath:            1.3.0
    - msgpack:           1.0.7
    - msgpack-numpy-opentensor: 0.5.0
    - multidict:         6.0.4
    - multiprocess:      0.70.15
    - munch:             2.5.0
    - mypy-extensions:   1.0.0
    - nest-asyncio:      1.6.0
    - netaddr:           0.10.1
    - netifaces:         0.11.0
    - networkx:          3.2.1
    - ninja:             1.11.1.1
    - numpy:             1.26.2
    - nvidia-cublas-cu12: 12.1.3.1
    - nvidia-cuda-cupti-cu12: 12.1.105
    - nvidia-cuda-nvrtc-cu12: 12.1.105
    - nvidia-cuda-runtime-cu12: 12.1.105
    - nvidia-cudnn-cu12: 8.9.2.26
    - nvidia-cufft-cu12: 11.0.2.54
    - nvidia-curand-cu12: 10.3.2.106
    - nvidia-cusolver-cu12: 11.4.5.107
    - nvidia-cusparse-cu12: 12.1.0.106
    - nvidia-nccl-cu12:  2.19.3
    - nvidia-nvjitlink-cu12: 12.3.101
    - nvidia-nvtx-cu12:  12.1.105
    - oauthlib:          3.2.0
    - omegaconf:         2.3.0
    - ordered-set:       4.1.0
    - packaging:         23.2
    - pandas:            2.2.0
    - password-strength: 0.0.3.post2
    - pathspec:          0.12.1
    - pathtools:         0.1.2
    - peft:              0.4.0
    - pexpect:           4.8.0
    - pillow:            10.1.0
    - pip:               23.3.2
    - pip-upgrader:      1.4.15
    - platformdirs:      4.1.0
    - pluggy:            1.4.0
    - pretrain-subnet:   2.1.4
    - protobuf:          4.23.4
    - psutil:            5.9.8
    - ptyprocess:        0.7.0
    - py:                1.11.0
    - py-bip39-bindings: 0.1.11
    - py-cpuinfo:        9.0.0
    - py-ed25519-zebra-bindings: 1.0.1
    - py-sr25519-bindings: 0.2.0
    - pyarrow:           15.0.0
    - pyarrow-hotfix:    0.6
    - pyasn1:            0.4.8
    - pyasn1-modules:    0.2.1
    - pycparser:         2.21
    - pycryptodome:      3.20.0
    - pydantic:          1.10.0
    - pydantic-core:     2.14.6
    - pygments:          2.17.2
    - pygobject:         3.42.1
    - pyhamcrest:        2.0.2
    - pyjwt:             2.3.0
    - pynacl:            1.5.0
    - pynvml:            11.5.0
    - pyopenssl:         24.0.0
    - pyparsing:         2.4.7
    - pyrsistent:        0.18.1
    - pyserial:          3.5
    - pytest:            7.4.4
    - pytest-asyncio:    0.23.3
    - python-apt:        2.4.0+ubuntu2
    - python-dateutil:   2.8.2
    - python-debian:     0.1.43+ubuntu1.1
    - python-dotenv:     1.0.1
    - python-levenshtein: 0.23.0
    - python-magic:      0.4.24
    - python-multipart:  0.0.6
    - pytorch-lightning: 2.1.3
    - pytorch-triton:    3.0.0+901819d2b6
    - pytz:              2022.1
    - pyxdg:             0.27
    - pyyaml:            5.4.1
    - rapidfuzz:         3.6.1
    - readchar:          4.0.5
    - regex:             2023.12.25
    - requests:          2.31.0
    - requests-oauthlib: 1.3.1
    - resolvelib:        0.8.1
    - responses:         0.18.0
    - retry:             0.9.2
    - rich:              13.7.0
    - rotary-emb:        0.1
    - rotary-embedding-torch: 0.5.3
    - rsa:               4.9
    - runs:              1.2.0
    - s3transfer:        0.10.0
    - safetensors:       0.4.1
    - scalecodec:        1.2.7
    - scikit-learn:      1.2.2
    - scipy:             1.12.0
    - secretstorage:     3.3.1
    - sentencepiece:     0.1.99
    - sentry-sdk:        1.39.2
    - service-identity:  18.1.0
    - setproctitle:      1.3.3
    - setuptools:        59.6.0
    - shtab:             1.6.5
    - six:               1.16.0
    - smmap:             5.0.1
    - sniffio:           1.3.0
    - sos:               4.5.6
    - soupsieve:         2.5
    - ssh-import-id:     5.11
    - starlette:         0.27.0
    - starsessions:      1.3.0
    - substrate-interface: 1.5.2
    - sympy:             1.12
    - systemd-python:    234
    - tensorboard:       2.15.1
    - tensorboard-data-server: 0.7.2
    - termcolor:         2.4.0
    - terminaltables:    3.1.10
    - threadpoolctl:     3.2.0
    - tokenizers:        0.13.3
    - tomli:             2.0.1
    - toolz:             0.12.1
    - torch:             2.1.2+cu118
    - torchaudio:        2.1.2+cu118
    - torchmetrics:      1.3.0.post0
    - torchvision:       0.16.2+cu118
    - tqdm:              4.66.1
    - traitlets:         5.14.1
    - transformers:      4.31.0
    - triton:            2.1.0
    - twisted:           22.1.0
    - types-python-dateutil: 2.8.19.20240106
    - typeshed-client:   2.4.0
    - typing-extensions: 4.8.0
    - tzdata:            2023.4
    - ubuntu-advantage-tools: 8001
    - ufw:               0.36.1
    - unattended-upgrades: 0.1
    - unidiff:           0.5.5
    - urllib3:           2.0.7
    - uvicorn:           0.22.0
    - wadllib:           1.3.6
    - wandb:             0.15.3
    - wcwidth:           0.2.13
    - websocket-client:  1.7.0
    - websockets:        11.0.3
    - werkzeug:          3.0.1
    - wheel:             0.37.1
    - xdg:               5
    - xentropy-cuda-lib: 0.1
    - xformers:          0.0.23.post1+cu118
    - xmod:              1.8.1
    - xxhash:            3.4.1
    - yarl:              1.9.4
    - zipp:              1.0.0
    - zope.interface:    5.4.0
    - zstandard:         0.22.0
    - zstd:              1.5.5.1
* System:
    - OS:                Linux
    - architecture:
        - 64bit
        - ELF
    - processor:         x86_64
    - python:            3.10.12
    - release:           5.15.0-92-generic
    - version:           #102-Ubuntu SMP Wed Jan 10 09:33:48 UTC 2024

</details>

More info

No response

awaelchli commented 10 months ago

@TeddLi Your logs suggest that 6 of the 8 processes have resumed the data loop, but the others haven't. Your script got stuck somewhere reading the data and then the barrier timed out. The problem is not with the barrier or with Fabric. You should investigate the data reading in this loop here https://github.com/jzhang38/TinyLlama/blob/11a02ce085c1670bd009e6d4385701ff06a7f6cf/pretrain/tinyllama.py#L198-L208 and check why the for-loop isn't progressing on that rank.

The logic to resume the training in the original TinyLlama is quite expensive, in our version we replaced it by loading the state of the dataloader directly: https://github.com/Lightning-AI/lit-gpt/blob/00defdee53f9b19511057a51499e23af2b1558a3/pretrain/tinyllama.py#L112 (but this only works with the streaming dataset, we have a different data processing than them)

awaelchli commented 10 months ago

@TeddLi Any luck there investigating this?

TeddLi commented 10 months ago

Nope, I didn't find out why. If I set the step to 20000, then it works. But If I set it a bit longer, e.g, 200000 step. Then It will freeze. I doubt it might be GPU sync issue...

awaelchli commented 9 months ago

It might just be that the resuming of the data takes more than 30 minutes, and some processes are slower than the other, then time out at the 30 minute mark (the default for NCCL). In this case, one option is to increase the timeout to something higher:

from datetime import timedelta

# in the FSDPStrategy configure the timeout
strategy = FSDPStrategy(
    timeout=timedelta(minutes=120),  # default is 30
    ...
)

The better way would be to reimplement the resuming logic like we did in our version of TinyLlama like I pointed out in the previous comment.

awaelchli commented 9 months ago

Ah, looking closer at the error it's actually timing out in all-reduce. The title of this issue has misled me to think it's the barrier. But the resuming the dataloader part is totally fine.

If it's failing at all-reduce, that's probably at .backward(). Can you confirm that? What changes have you made to the script?

TeddLi commented 9 months ago

@awaelchli Hey, Thanks for taking a close look at that. Honestly, I just switched a machine..... The new server provider used docker. I doubt its a hardware issue

TeddLi commented 9 months ago

Also I do try to extend the timeout from 30 min to 8 hours. But still, no luck to make it run properly. I am not sure if the extend time would solve it

awaelchli commented 9 months ago

Is it working on the new server after switching machine or do you still see the issue?

TeddLi commented 9 months ago

The issue is gone. For the machine that has that issue, I am still hold it though. If I just train from start, it won't hit any issue.

TeddLi commented 9 months ago

@awaelchli If you want to look into that, I can provide the info you need. Just close the ticket for now

awaelchli commented 9 months ago

Ok thanks. If it happens again in the future, let me know and we can do some debugging :)

Lightning-AI / pytorch-lightning