NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

finish training successfully

Current behaviour

The training runs smooths in the first 90% updates. It breaks all of sudden and throws the following error.


{'loss': 0.1146, 'grad_norm': 0.7724757844977304, 'learning_rate': 2.49474072871132e-07, 'epoch': 3.7}
{'loss': 0.1193, 'grad_norm': 0.7588049837461633, 'learning_rate': 2.4083238061252565e-07, 'epoch': 3.7}
{'loss': 0.1162, 'grad_norm': 0.7664973134640671, 'learning_rate': 2.3234118679127615e-07, 'epoch': 3.71}
{'loss': 0.1171, 'grad_norm': 0.7643471074230774, 'learning_rate': 2.2400062235209407e-07, 'epoch': 3.71}
{'loss': 0.1218, 'grad_norm': 0.7608065976489313, 'learning_rate': 2.1581081591680042e-07, 'epoch': 3.72}
{'loss': 0.1164, 'grad_norm': 0.739587746546741, 'learning_rate': 2.077718937823414e-07, 'epoch': 3.72}
{'loss': 0.1156, 'grad_norm': 0.7384305965058745, 'learning_rate': 1.9988397991884455e-07, 'epoch': 3.73}
 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                 | 837/888 [9:42:50<34:08, 40.17s/it][rank7]:[E ProcessGroupNCCL.cpp:1182] [Rank 7] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1708025847130/work/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8d2e380d87 in /export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8d2e33175f in /export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8d2e7868a8 in /export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x7f8ce263119c in /export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f8ce26352b8 in /export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x7f8ce26389ea in /export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f8ce2639629 in /export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdbbf4 (0x7f8d2dec7bf4 in /export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7f8d36e94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126a40 (0x7f8d36f26a40 in /lib/x86_64-linux-gnu/libc.so.6)

[2024-04-01 08:51:29,802] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 488212 closing signal SIGTERM
[2024-04-01 08:51:29,802] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 488213 closing signal SIGTERM
[2024-04-01 08:51:29,803] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 488214 closing signal SIGTERM
[2024-04-01 08:51:29,804] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 488215 closing signal SIGTERM
[2024-04-01 08:51:29,805] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 488216 closing signal SIGTERM
[2024-04-01 08:51:29,805] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 488217 closing signal SIGTERM
[2024-04-01 08:51:29,807] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 488218 closing signal SIGTERM
[2024-04-01 08:51:31,677] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 7 (pid: 488219) of binary: /export/home/global_conda/envs/axolotl/bin/python
Traceback (most recent call last):
  File "/export/home/global_conda/envs/axolotl/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1048, in launch_command
    multi_gpu_launcher(args)
  File "/export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/accelerate/commands/launch.py", line 702, in multi_gpu_launcher
    distrib_run.run(args)
  File "/export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
axolotl.cli.train FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-01_08:51:29
  host      : xxxx
  rank      : 7 (local_rank: 7)
  exitcode  : -6 (pid: 488219)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 488219
=======================================================

Steps to reproduce

default setup then run the default training command the yaml given

accelerate launch -m axolotl.cli.train train_config.yml --deepspeed deepspeed_configs/zero3_bf16.json

Config yaml

base_model: mistralai/Mistral-7B-v0.1
model_type: MistralForCausalLM
tokenizer_type: AutoTokenizer
tokenizer_config: NousResearch/Hermes-2-Pro-Mistral-7B
is_mistral_derived_model: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: teknium/OpenHermes-2.5
    type: sharegpt
    conversation: OpenHermes-2.5-Mistral-7B
    field_human: human
    field_model: gpt
    roles:
      input: [system, human]
      output: [gpt]
    train_on_split: train

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true
eval_sample_packing: false

gradient_accumulation_steps: 16
micro_batch_size: 4
num_epochs: 4
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.00002
warmup_steps: 88

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

output_dir: debug

val_set_size: 0
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "<|im_end|>"
  pad_token: "</s>"
  unk_token: "<unk>"
tokens:
  - "<|im_end|>"
  - "<|im_start|>"

Possible solution

No response

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

axolotl-ai-cloud / axolotl

NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered #1473