Unable to finetune Mistral-7B with DeepSpeed Zero3

mgoulao commented 8 months ago

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I'm currently testing if I can fit the sequence_len so at the very least it should give an OOM. I have tried changing the sample_packing to false but it just returns a different error.

My setup is not ideal since I'm using Pytorch with CUDA 11.7 and BitsAndBytes with CUDA 12.2 (the version of my GPUs driver).

Here is my pip list (filtered)

deepspeed                 0.12.6
flash-attn                2.3.3
optimum                   1.13.2
safetensors               0.4.1
tokenizers                0.15.0
torch                     2.0.1+cu117
xformers                  0.0.22

Current behaviour

I obtain the following errors: sample_packing=true

...
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [8926,0,0], thread: [58,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` fa
iled.                                                                                                                                                                                
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [8926,0,0], thread: [59,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` fa
iled.                                                                                                                                                                                
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [8926,0,0], thread: [60,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` fa
iled.                                                                                                                                                                                
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [8926,0,0], thread: [61,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` fa
iled.                                                                                                                                                                                
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [8926,0,0], thread: [62,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` fa
iled.                                                                                                                                                                                
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [8926,0,0], thread: [63,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` fa
iled.                                                                                                                                                                                
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 175956 closing signal SIGTERM                                                                                  
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 175957 closing signal SIGTERM                                                                                  
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 175958 closing signal SIGTERM                                                                                  
wandb: WARNING No program path found, not creating job artifact. See https://docs.wandb.ai/guides/launch/create-job                                                                  
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 3 (pid: 175959) of binary: /.../.venv/bin/python                     
Traceback (most recent call last):                                                                                                                                                   
  File "/..../.venv/bin/accelerate", line 8, in <module>                                                                                                    
    sys.exit(main())                                                                                                                                                                 
  File "/.../.venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main                                                   
    args.func(args)                                                                                                                                                                  
  File "/.../.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command                                               
    deepspeed_launcher(args)                                                                                                                                                         
  File "/.../.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 724, in deepspeed_launcher                                            
    distrib_run.run(args)                                                                                                                                                            
  File "/.../.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run                                                                
    elastic_launch(                                                                                                                                                                  
  File "/.../.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__                                                  
    return launch_agent(self._config, self._entrypoint, list(args))                                                                                                                  
  File "/.../.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent                                              
    raise ChildFailedError(                                                                                                                                                          
torch.distributed.elastic.multiprocessing.errors.ChildFailedError

sample_packing=false

Traceback (most recent call last):                                                        
  File "/opt/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/.../axolotl/src/axolotl/cli/train.py", line 42, in <module>
    fire.Fire(do_cli)            
  File "/.../.venv/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/.../.venv/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/.../.venv/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/.../axolotl/src/axolotl/cli/train.py", line 38, in do_cli
    train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
  File "/.../axolotl/src/axolotl/train.py", line 142, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/.../.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1543, in train
    return inner_training_loop(                                                                                                                                                      
  File "/.../.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/.../.venv/lib/python3.10/site-packages/transformers/trainer.py", line 2746, in training_step
    self.accelerator.backward(loss)
  File "/.../.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1958, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/.../.venv/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/.../.venv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)                                                       
  File "/.../.venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1955, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/.../.venv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/.../.venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2135, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/.../.venv/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/.../.venv/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/.../.venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: The size of tensor a (0) must match the size of tensor b (14336) at non-singleton dimension 1

Steps to reproduce

I have created a accelerate config file:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: deepspeed/zero3_bf16.json
  zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

And I'm running the following command:

accelerate launch --config_file accelerate_config/multi_gpu_config.yaml -m axolotl.cli.train mistral-7b-instruct-v0.2-mullti-gpu.yaml

### Config yaml

```yaml
base_model: mistralai/Mistral-7B-Instruct-v0.2
model_type: MistralForCausalLM
model_config:
  sliding_window: 4096
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: dataset.parquet
    type: alpaca
dataset_prepared_path:
val_set_size: 0.05
output_dir: out

sequence_len: 1024
sample_packing: false
pad_to_sequence_len: true
eval_sample_packing: false

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 3
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.000005

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_table_max_new_tokens: 128
saves_per_epoch: 0.5
debug:
deepspeed: deepspeed/zero3_bf16.json
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

Possible solution

No response

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.10

axolotl branch-commit

main/d69ba2b0b76fad112acecd5a1fbb339e6244ff7b

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

mgoulao commented 8 months ago

Update: Increasing the micro_batch_size to 8 seems to do the trick, however, I'm now wondering if it should be possible to use smaller batch sizes?