DeepSpeed Zero2 Config Mistral 7b instruct V0.2 FineTuning Fails to Start on single 3090

risedangel commented 11 months ago

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

ı would expect finetuning to start withouth a problem

Current behaviour

It fails and spits error. Here is the output

                            dP            dP   dP 
                             88            88   88 
  .d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88 
  88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88 
  88.  .88  .d88b.  88.  .88 88 88.  .88   88   88 
  `88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP

[2023-12-16 22:47:07,603] [WARNING] [axolotl.scripts.check_user_token:358] [PID:441] [RANK:0] Error verifying HuggingFace token. Remember to log in using huggingface-cli login and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets. [2023-12-16 22:47:07,841] [DEBUG] [axolotl.load_tokenizer:167] [PID:441] [RANK:0] EOS: 2 / [2023-12-16 22:47:07,841] [DEBUG] [axolotl.load_tokenizer:168] [PID:441] [RANK:0] BOS: 1 / ~~[2023-12-16 22:47:07,841] [DEBUG] [axolotl.load_tokenizer:169] [PID:441] [RANK:0] PAD: 2 /~~ [2023-12-16 22:47:07,841] [DEBUG] [axolotl.load_tokenizer:170] [PID:441] [RANK:0] UNK: 0 / [2023-12-16 22:47:07,841] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:441] [RANK:0] Loading prepared dataset from disk at last_run_prepared/f98d5b0b00654992f42fe72d04d0e1f1... [2023-12-16 22:47:07,843] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:441] [RANK:0] Prepared dataset loaded from disk... Filter (num_proc=12): 100%|█████| 49318/49318 [00:01<00:00, 37641.02 examples/s] Filter (num_proc=12): 100%|███████| 2596/2596 [00:00<00:00, 14676.47 examples/s] Map (num_proc=12): 100%|████████| 49318/49318 [00:01<00:00, 31290.48 examples/s] Map (num_proc=12): 100%|██████████| 2596/2596 [00:00<00:00, 10745.16 examples/s] [2023-12-16 22:47:11,573] [DEBUG] [axolotl.log:60] [PID:441] [RANK:0] total_num_tokens: 583433 [2023-12-16 22:47:11,586] [DEBUG] [axolotl.log:60] [PID:441] [RANK:0] total_supervised_tokens: 355619 [2023-12-16 22:47:14,426] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:441] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 583433 [2023-12-16 22:47:14,426] [DEBUG] [axolotl.log:60] [PID:441] [RANK:0] data_loader_len: 34 [2023-12-16 22:47:14,426] [INFO] [axolotl.log:60] [PID:441] [RANK:0] sample_packing_eff_est across ranks: [0.9891645643446181] [2023-12-16 22:47:14,426] [DEBUG] [axolotl.log:60] [PID:441] [RANK:0] sample_packing_eff_est: None [2023-12-16 22:47:14,426] [DEBUG] [axolotl.log:60] [PID:441] [RANK:0] total_num_steps: 34 [2023-12-16 22:47:14,454] [DEBUG] [axolotl.log:60] [PID:441] [RANK:0] total_num_tokens: 10877620 [2023-12-16 22:47:14,671] [DEBUG] [axolotl.log:60] [PID:441] [RANK:0] total_supervised_tokens: 6550004 [2023-12-16 22:47:14,776] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:441] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 10877620 [2023-12-16 22:47:14,777] [DEBUG] [axolotl.log:60] [PID:441] [RANK:0] data_loader_len: 656 [2023-12-16 22:47:14,777] [INFO] [axolotl.log:60] [PID:441] [RANK:0] sample_packing_eff_est across ranks: [0.9894444654666542] [2023-12-16 22:47:14,777] [DEBUG] [axolotl.log:60] [PID:441] [RANK:0] sample_packing_eff_est: 0.99 [2023-12-16 22:47:14,777] [DEBUG] [axolotl.log:60] [PID:441] [RANK:0] total_num_steps: 656 [2023-12-16 22:47:14,781] [DEBUG] [axolotl.train.log:60] [PID:441] [RANK:0] loading tokenizer... mistralai/Mistral-7B-Instruct-v0.2 [2023-12-16 22:47:15,026] [DEBUG] [axolotl.load_tokenizer:167] [PID:441] [RANK:0] EOS: 2 / [2023-12-16 22:47:15,026] [DEBUG] [axolotl.load_tokenizer:168] [PID:441] [RANK:0] BOS: 1 / ~~[2023-12-16 22:47:15,026] [DEBUG] [axolotl.load_tokenizer:169] [PID:441] [RANK:0] PAD: 2 /~~ [2023-12-16 22:47:15,026] [DEBUG] [axolotl.load_tokenizer:170] [PID:441] [RANK:0] UNK: 0 / [2023-12-16 22:47:15,026] [DEBUG] [axolotl.train.log:60] [PID:441] [RANK:0] loading model and peft_config... [2023-12-16 22:47:15,187] [INFO] [axolotl.load_model:250] [PID:441] [RANK:0] patching with flash attention Loading checkpoint shards: 100%|██████████████████| 3/3 [00:07<00:00, 2.55s/it] [2023-12-16 22:47:24,147] [INFO] [axolotl.load_model:505] [PID:441] [RANK:0] GPU memory usage after model load: 4.343GB (+0.114GB cache, +0.891GB misc) [2023-12-16 22:47:24,151] [INFO] [axolotl.load_model:528] [PID:441] [RANK:0] converting PEFT model w/ prepare_model_for_kbit_training [2023-12-16 22:47:24,153] [INFO] [axolotl.load_model:540] [PID:441] [RANK:0] converting modules to torch.bfloat16 for flash attention [2023-12-16 22:47:24,155] [INFO] [axolotl.load_lora:643] [PID:441] [RANK:0] found linear modules: ['q_proj', 'gate_proj', 'v_proj', 'up_proj', 'down_proj', 'o_proj', 'k_proj'] [2023-12-16 22:47:24,169] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:441] CUDA extension not installed. [2023-12-16 22:47:24,170] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:441] CUDA extension not installed. trainable params: 83,886,080 || all params: 7,325,618,176 || trainable%: 1.1451058188485088 [2023-12-16 22:47:24,635] [INFO] [axolotl.load_model:570] [PID:441] [RANK:0] GPU memory usage after adapters: 4.668GB (+0.914GB cache, +0.891GB misc) [2023-12-16 22:47:24,658] [INFO] [axolotl.train.log:60] [PID:441] [RANK:0] Pre-saving adapter config to ./qlora-out [2023-12-16 22:47:24,660] [INFO] [axolotl.train.log:60] [PID:441] [RANK:0] Starting trainer... [2023-12-16 22:47:24,864] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:441] [RANK:0] packing_efficiency_estimate: 0.99 total_num_tokens per device: 10877620 [2023-12-16 22:47:24,879] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:441] [RANK:0] packing_efficiency_estimate: 0.99 total_num_tokens per device: 10877620 [2023-12-16 22:47:24,886] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) 0%| | 0/165 [00:00<?, ?it/s][2023-12-16 22:47:26,481] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:441] [RANK:0] packing_efficiency_estimate: 0.99 total_num_tokens per device: 10877620 Traceback (most recent call last): File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/workspace/axolotl/src/axolotl/cli/train.py", line 38, in fire.Fire(do_cli) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(varargs, kwargs) File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta) File "/workspace/axolotl/src/axolotl/train.py", line 129, in train trainer.train(resume_from_checkpoint=resume_from_checkpoint) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1540, in train return inner_training_loop( File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1857, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2733, in training_step loss = self.compute_loss(model, inputs) File "/workspace/axolotl/src/axolotl/core/trainer_builder.py", line 291, in compute_loss return super().compute_loss(model, inputs, return_outputs=return_outputs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2756, in compute_loss outputs = model(inputs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/utils/operations.py", line 659, in forward return model_forward(*args, *kwargs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/utils/operations.py", line 647, in call return convert_to_fp32(self.model_forward(args, kwargs)) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast return func(*args, kwargs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/peft/peft_model.py", line 977, in forward return self.base_model( File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 106, in forward return self.model.forward(args, kwargs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward output = module._old_forward(*args, kwargs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1053, in forward outputs = self.model( File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward output = module._old_forward(args, kwargs) File "/workspace/axolotl/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py", line 489, in mistral_model_forward self._prepare_decoder_attention_mask( # pylint: disable=protected-access File "/workspace/axolotl/src/axolotl/monkeypatch/mistral_attn_hijack_flash.py", line 103, in _prepare_decoder_attention_mask sliding_window_mask = _make_sliding_window_causal_mask( RuntimeError: _make_sliding_window_causal_mask() Expected a value of type 'int' for argument 'sliding_window' but instead found type 'NoneType'. Position: 5 Value: None Declaration: _make_sliding_window_causal_mask(int bsz, int tgt_len, int dtype, Device device, int past_key_values_length=0, int sliding_window=4096) -> Tensor Cast error details: Unable to cast Python instance to C++ type (#define PYBIND11_DETAILED_ERROR_MESSAGES or compile in debug mode for details) 0%| | 0/165 [00:00<?, ?it/s]
Traceback (most recent call last): File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in sys.exit(main()) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 994, in launch_command simple_launcher(args) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 636, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/root/miniconda3/envs/py3.10/bin/python3', '-m', 'axolotl.cli.train', 'examples/mistral/qloraEdited.yml', '--deepspeed', 'deepspeed/zero2.json']' returned non-zero exit status

Steps to reproduce

Start the training with a slightly edited QLora script, i only changed the model and the dataset. The model was mistral anyways.

Config yaml

No response

Possible solution

No response

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.10

axolotl branch-commit

docker

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

Nondzu commented 11 months ago

Also have issue with Mistral 7b instruct V0.2. no problem with V0.1 model

NanoCode012 commented 10 months ago

May I ask if you could try running with normal accelerate since you only have single GPU?

nirogu commented 10 months ago

I have the exact same issue with instruct v0.2. In my case, it happened using the code from the repo (not docker) and the exact qlora example config with accelerate (no deepspeed or any other editions to the yml). So, it looks like the constant that breaks everything is the model version, as the error keeps happening even if everything else is different

NanoCode012 commented 10 months ago

Hi @Nirogu @risedangel @Nondzu . I encountered this issue previously, fixed, and closed the issue. I wasn't aware there was a duplicate. Please see https://github.com/OpenAccess-AI-Collective/axolotl/issues/1047

DreamGenX commented 9 months ago

Hi @Nirogu @risedangel @Nondzu . I encountered this issue previously, fixed, and closed the issue. I wasn't aware there was a duplicate. Please see #1047

While this makes the training work, it effectively re-adds the sliding window, which is undesirable.

NanoCode012 commented 9 months ago

Thanks to Dream's PR. You don't need my prior override anymore. You may use as-is.

axolotl-ai-cloud / axolotl