axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.75k stars 851 forks source link

mistral deep speed zero bug axolotl.cli.train FAILED 8XA100 #891

Closed manishiitg closed 8 months ago

manishiitg commented 11 months ago

Please check that this issue hasn't been reported before.

Expected Behavior

should work

Current behaviour

[2023-11-24 06:23:08,281] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 92 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 93 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 95 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 96 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 97 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 98 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 99 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 2 (pid: 94) of binary: /root/miniconda3/envs/py3.10/bin/python3
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 985, in launch_command
    multi_gpu_launcher(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
==================================================
axolotl.cli.train FAILED
--------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-24_06:23:18
  host      : b97032f8758d
  rank      : 2 (local_rank: 2)
  exitcode  : -7 (pid: 94)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 94
==================================================

Steps to reproduce

i have tried using deepspeed zero1, zero3, zero2 all

!docker run --gpus all \
    -v ~/sky_workdir:/sky_workdir \
    -v /root/.cache:/root/.cache \
    -v /sky-notebook:/sky-notebook \
    winglian/axolotl:main-py3.10-cu118-2.0.1 \
    accelerate launch -m axolotl.cli.train /sky_workdir/hi-qlora.yaml --deepspeed /sky_workdir/zero1.json

Config yaml

base_model: mistralai/Mistral-7B-v0.1 model_type: MistralForCausalLM tokenizer_type: LlamaTokenizer is_mistral_derived_model: true

load_in_8bit: false load_in_4bit: true strict: false

datasets:

hub_model_id: manishiitg/Mistral-databricks-databricks-dolly-15k-hi

hf_use_auth_token: true

dataset_prepared_path: val_set_size: 0.05 output_dir: /sky-notebook/manishiitg/Mistral-databricks-databricks-dolly-15k-hi

adapter: qlora lora_model_dir:

sequence_len: 4096 sample_packing: true pad_to_sequence_len: true

lora_r: 16 lora_alpha: 32 lora_dropout: 0.05 lora_target_linear: true lora_fan_in_fan_out: lora_target_modules:

wandb_project: wandb_entity: wandb_watch: wandb_run_id: wandb_log_model:

gradient_accumulation_steps: 4 micro_batch_size: 8 num_epochs: 2 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002

train_on_inputs: false group_by_length: false bf16: true fp16: false tf32: false

gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: auto_resume_from_checkpoints: true ## manage check point resume from here local_rank: logging_steps: 1 xformers_attention: flash_attention: true

warmup_steps: 10 eval_steps: 0.05 eval_table_size: eval_table_max_new_tokens: 128 save_steps: 100 ## increase based on your dataset save_strategy: steps debug: deepspeed: /sky_workdir/zero3.json weight_decay: 0.0 fsdp: fsdp_config: special_tokens: bos_token: "" eos_token: "" unk_token: ""

Possible solution

No response

Which Operating Systems are you using?

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

manishiitg commented 10 months ago

this error doesn't come when i used A100X4 GPU's but only comes when is used A100 x 8 GPU's

NanoCode012 commented 9 months ago

Could you try upgrading deepspeed according to https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/docs/faq.md

monk1337 commented 8 months ago

Try the docker image with these args

sudo docker run --gpus '"all"' --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --rm -it winglian/axolotl:main-py3.10-cu118-2.0.1

manishiitg commented 8 months ago

above worked :)