axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.57k stars 821 forks source link

Train FAILED. Crashed while training with SIGTERM #1670

Open RodriMora opened 3 months ago

RodriMora commented 3 months ago

Please check that this issue hasn't been reported before.

Expected Behavior

The fine-tuning process completes without errors or crashes

Current behaviour

The process stops with SIGTERM errors

Steps to reproduce

I run the provided advanced docker command in the docs

docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src="${PWD}",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-latest

I get into the container just fine. Then:

CUDA_VISIBLE_DEVICES="" python -m axolotl.cli.preprocess examples/openllama-3b/lora.yml

Output: preprocess.txt

Then: accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml --deepspeed deepspeed_configs/zero1.json

[E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=604, OpType=ALLREDUCE, NumelIn=12712960, NumelOut=12712960, Timeout(ms)=1800000) ran for 1800616 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=604, OpType=ALLREDUCE, NumelIn=12712960, NumelOut=12712960, Timeout(ms)=1800000) ran for 1800616 milliseconds before timing out. [2024-05-29 08:24:13,026] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 812 closing signal SIGTERM [2024-05-29 08:24:13,291] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 813) of binary: /root/miniconda3/envs/py3.10/bin/python Traceback (most recent call last): File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in sys.exit(main()) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main args.func(args) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1073, in launch_command multi_gpu_launcher(args) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher distrib_run.run(args) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

axolotl.cli.train FAILED

Failures: [1]: time : 2024-05-29_08:24:13 host : dc53c9f6e164 rank : 2 (local_rank: 2) exitcode : -6 (pid: 814) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 814 [2]: time : 2024-05-29_08:24:13 host : dc53c9f6e164 rank : 3 (local_rank: 3) exitcode : -6 (pid: 815) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 815

Root Cause (first observed failure): [0]: time : 2024-05-29_08:24:13 host : dc53c9f6e164 rank : 1 (local_rank: 1) exitcode : -6 (pid: 813) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 813

Full output here: train.txt

My system:

Ubuntu 22.04 AMD Epyc 7402 512GB RAM 4x3090's

image

Config yaml

The default examples/openllama-3b/lora.yml provided in the repo

Possible solution

No response

Which Operating Systems are you using?

Python Version

Python 3.10.14 - The one inside the docker image

axolotl branch-commit

main/49b967b

Acknowledgements

winglian commented 3 months ago

@RodriMora I believe this is fixed by #1676. Was the timeout happening at the end of an epoch or training?

shopigarner commented 3 months ago

Seeing this same behavior. The timeout happens at the end of training, it seems to just hang at the last step, sometimes the error OP posted appears. This error doesn't happen every time though and I do not know what's different about fine-tunes that work vs ones that do not.

I've tried docker images for:

They all seem to do the same thing, freeze at the last step in training. Would be happy to try anything to see if we can fix this.

shopigarner commented 3 months ago

False alarm! The newer image winglian/axolotl:main-202400610-py3.10-cu118-2.1.2 indeed fixes the issue 🥳

psimm commented 3 months ago

I'm still getting what I think is the same issue using the Docker image winglian/axolotl/main-20240616-py3.11-cu121-2.2.2

https://hub.docker.com/layers/winglian/axolotl/main-20240616-py3.11-cu121-2.2.2/images/sha256-81e9b559535e35e580cc0dbb43b92c2ea89a434ba3880a735360714b8182f7fd?context=explore

The error occurs at the end of training.

Use the Modal llm-finetuning repo with this updated Docker image on a single H100.

[rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2263, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=300000) ran for 300189 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f26ddd81d87 in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f269309c6e6 in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f269309fc3d in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f26930a0839 in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7f26de2b1bf4 in /root/miniconda3/envs/py3.11/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7f26df694ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126a40 (0x7f26df726a40 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[2024-06-16 13:11:36,589] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 28 closing signal SIGTERM
[2024-06-16 13:11:36,808] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 29) of binary: /root/miniconda3/envs/py3.11/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.11/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
    multi_gpu_launcher(args)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
===================================================
axolotl.cli.train FAILED
---------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
---------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-16_13:11:36
  host      : localhost
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 29)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 29
===================================================
Traceback (most recent call last):
  File "/pkg/modal/_container_io_manager.py", line 487, in handle_input_exception
    yield
  File "/pkg/modal/_container_entrypoint.py", line 239, in run_input_sync
    res = finalized_function.callable(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/src/train.py", line 36, in train
    run_cmd(cmd, run_folder)
  File "/root/src/train.py", line 183, in run_cmd
    exit(exit_code)
  File "<frozen _sitebuiltins>", line 26, in __call__
SystemExit: 1
RodriMora commented 3 months ago

To be honest I don't know what I'm doing wrong. I just tried with a bunch of versions of the docker image, with winglian/axolotl:main-20240610-py3.11-cu121-2.3.0:

(had to edit the ${PWD} and ${HOME} to $PWD and $HOME from the README command to work)

docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src=$PWD,target=/workspace/axolotl -v $HOME/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-20240610-py3.11-cu121-2.3.0

And then running once it downloads and loads as root to the workspace inside the docker container: accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml

I get this errors as if axototl was not installed:

[W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost.lan]:29500 (errno: 97 - Address family not supported by protocol).
/root/miniconda3/envs/py3.11/bin/python: Error while finding module specification for 'axolotl.cli.train' (ModuleNotFoundError: No module named 'axolotl')
/root/miniconda3/envs/py3.11/bin/python: Error while finding module specification for 'axolotl.cli.train' (ModuleNotFoundError: No module named 'axolotl')
/root/miniconda3/envs/py3.11/bin/python: Error while finding module specification for 'axolotl.cli.train' (ModuleNotFoundError: No module named 'axolotl')
/root/miniconda3/envs/py3.11/bin/python: Error while finding module specification for 'axolotl.cli.train' (ModuleNotFoundError: No module named 'axolotl')
E0616 17:46:39.157000 127746857011008 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 83) of binary: /root/miniconda3/envs/py3.11/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.11/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
    multi_gpu_launcher(args)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
axolotl.cli.train FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-06-16_17:46:39
  host      : b99519cb976b
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 84)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-06-16_17:46:39
  host      : b99519cb976b
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 85)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-06-16_17:46:39
  host      : b99519cb976b
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 86)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-16_17:46:39
  host      : b99519cb976b
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 83)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
psimm commented 2 months ago

In my case the issue disappeared when I removed the hub_model_id setting

winglian commented 2 months ago

@psimm Is your docker container not configured to have access to the external internet?

psimm commented 1 month ago

@winglian The Docker container has access to the external internet. I experimented more and noticed three things:

  1. The upload to HF fails with large files (~10GB)
  2. The upload succeeds with smaller files (~1.8GB)
  3. The upload also works for the large files when I increase AXOLOTL_NCCL_TIMEOUT

I think the issue is that the large upload took longer than the previous timeout setting which was just 60 (see https://github.com/modal-labs/llm-finetuning/blob/main/src/common.py)