Open fwangut opened 8 months ago
finish training successfully
The training runs smooths in the first 90% updates. It breaks all of sudden and throws the following error.
{'loss': 0.1146, 'grad_norm': 0.7724757844977304, 'learning_rate': 2.49474072871132e-07, 'epoch': 3.7} {'loss': 0.1193, 'grad_norm': 0.7588049837461633, 'learning_rate': 2.4083238061252565e-07, 'epoch': 3.7} {'loss': 0.1162, 'grad_norm': 0.7664973134640671, 'learning_rate': 2.3234118679127615e-07, 'epoch': 3.71} {'loss': 0.1171, 'grad_norm': 0.7643471074230774, 'learning_rate': 2.2400062235209407e-07, 'epoch': 3.71} {'loss': 0.1218, 'grad_norm': 0.7608065976489313, 'learning_rate': 2.1581081591680042e-07, 'epoch': 3.72} {'loss': 0.1164, 'grad_norm': 0.739587746546741, 'learning_rate': 2.077718937823414e-07, 'epoch': 3.72} {'loss': 0.1156, 'grad_norm': 0.7384305965058745, 'learning_rate': 1.9988397991884455e-07, 'epoch': 3.73} 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 837/888 [9:42:50<34:08, 40.17s/it][rank7]:[E ProcessGroupNCCL.cpp:1182] [Rank 7] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1708025847130/work/c10/cuda/CUDAException.cpp:44 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8d2e380d87 in /export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8d2e33175f in /export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8d2e7868a8 in /export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x7f8ce263119c in /export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f8ce26352b8 in /export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x7f8ce26389ea in /export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f8ce2639629 in /export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #7: <unknown function> + 0xdbbf4 (0x7f8d2dec7bf4 in /export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6) frame #8: <unknown function> + 0x94ac3 (0x7f8d36e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: <unknown function> + 0x126a40 (0x7f8d36f26a40 in /lib/x86_64-linux-gnu/libc.so.6) [2024-04-01 08:51:29,802] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 488212 closing signal SIGTERM [2024-04-01 08:51:29,802] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 488213 closing signal SIGTERM [2024-04-01 08:51:29,803] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 488214 closing signal SIGTERM [2024-04-01 08:51:29,804] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 488215 closing signal SIGTERM [2024-04-01 08:51:29,805] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 488216 closing signal SIGTERM [2024-04-01 08:51:29,805] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 488217 closing signal SIGTERM [2024-04-01 08:51:29,807] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 488218 closing signal SIGTERM [2024-04-01 08:51:31,677] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 7 (pid: 488219) of binary: /export/home/global_conda/envs/axolotl/bin/python Traceback (most recent call last): File "/export/home/global_conda/envs/axolotl/bin/accelerate", line 8, in <module> sys.exit(main()) File "/export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main args.func(args) File "/export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1048, in launch_command multi_gpu_launcher(args) File "/export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/accelerate/commands/launch.py", line 702, in multi_gpu_launcher distrib_run.run(args) File "/export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/export/home/global_conda/envs/axolotl/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ======================================================= axolotl.cli.train FAILED ------------------------------------------------------- Failures: <NO_OTHER_FAILURES> ------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2024-04-01_08:51:29 host : xxxx rank : 7 (local_rank: 7) exitcode : -6 (pid: 488219) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 488219 =======================================================
default setup then run the default training command the yaml given
accelerate launch -m axolotl.cli.train train_config.yml --deepspeed deepspeed_configs/zero3_bf16.json
base_model: mistralai/Mistral-7B-v0.1 model_type: MistralForCausalLM tokenizer_type: AutoTokenizer tokenizer_config: NousResearch/Hermes-2-Pro-Mistral-7B is_mistral_derived_model: true load_in_8bit: false load_in_4bit: false strict: false datasets: - path: teknium/OpenHermes-2.5 type: sharegpt conversation: OpenHermes-2.5-Mistral-7B field_human: human field_model: gpt roles: input: [system, human] output: [gpt] train_on_split: train sequence_len: 4096 sample_packing: true pad_to_sequence_len: true eval_sample_packing: false gradient_accumulation_steps: 16 micro_batch_size: 4 num_epochs: 4 optimizer: adamw_torch lr_scheduler: cosine learning_rate: 0.00002 warmup_steps: 88 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true output_dir: debug val_set_size: 0 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens: bos_token: "<s>" eos_token: "<|im_end|>" pad_token: "</s>" unk_token: "<unk>" tokens: - "<|im_end|>" - "<|im_start|>"
No response
3.10
main
i just experienced this model for both local and Docker installation. Did you find out any workaround? @fwangut
Please check that this issue hasn't been reported before.
Expected Behavior
finish training successfully
Current behaviour
The training runs smooths in the first 90% updates. It breaks all of sudden and throws the following error.
Steps to reproduce
default setup then run the default training command the yaml given
accelerate launch -m axolotl.cli.train train_config.yml --deepspeed deepspeed_configs/zero3_bf16.json
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main
Acknowledgements