axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
6.98k stars 766 forks source link

The error maze of deepspeed + qlora + falcon #207

Open utensil opened 1 year ago

utensil commented 1 year ago

I've been trying to make the combination deepspeed + qlora + falcon work but due to unknown reasons I've stuck in an error maze.

Setup

push_dataset_to_hub: utensil hf_use_auth_token: true

datasets:

dataset_prepared_path: last_run_prepared val_set_size: 0.01

enable QLoRA

adapter: qlora lora_model_dir: sequence_len: 2048 max_packed_sequence_len:

hyperparameters from QLoRA paper Appendix B.2

"We find hyperparameters to be largely robust across datasets"

lora_r: 64 lora_alpha: 16

0.1 for models up to 13B

0.05 for 33B and 65B models

lora_dropout: 0.05

add LoRA modules on all linear layers of the base model

lora_target_modules: lora_target_linear: true lora_fan_in_fan_out:

wandb_project: falcon-qlora wandb_watch: wandb_run_id: wandb_log_model: output_dir: /content/axolotl-trained/falcon-qlora-40b-gsm8k/

QLoRA paper Table 9

- 16 for 7b & 13b

- 32 for 33b, 64 for 64b

Max size tested on A6000

- 7b: 40

- 40b: 4

decrease if OOM, increase for max VRAM utilization

micro_batch_size: 1 gradient_accumulation_steps: 1 num_epochs: 3

Optimizer for QLoRA

optimizer: paged_adamw_32bit

torchdistx_path:

lr_scheduler: cosine

QLoRA paper Table 9

- 2e-4 for 7b & 13b

- 1e-4 for 33b & 64b

learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: true fp16: false tf32: true gradient_checkpointing: true

stop training after this many evaluation losses have increased in a row

https://huggingface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallback

early_stopping_patience: 3 resume_from_checkpoint: auto_resume_from_checkpoints: true local_rank: logging_steps: 1 xformers_attention: true flash_attention: gptq_groupsize: gptq_model_v1: warmup_steps: 10 eval_steps: 5 save_steps: 10 debug: deepspeed: weight_decay: 0.01 fsdp: fsdp_config: special_tokens: pad_token: "<|endoftext|>" bos_token: ">>ABSTRACT<<" eos_token: "<|endoftext|>"

- Environment reported by `ds_report`

Setting ds_accelerator to cuda (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

async_io ............... [YES] ...... [OKAY] cpu_adagrad ............ [YES] ...... [OKAY] cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] quantizer .............. [YES] ...... [OKAY] random_ltd ............. [YES] ...... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [YES] ...... [OKAY] transformer ............ [YES] ...... [OKAY] stochastic_transformer . [YES] ...... [OKAY] transformer_inference .. [YES] ...... [OKAY] utils .................. [YES] ...... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch'] torch version .................... 2.0.1+cu118 deepspeed install path ........... ['/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed'] deepspeed info ................... 0.9.3+52907a66, 52907a66, master torch cuda version ............... 11.8 torch hip version ................ None nvcc version ..................... 11.8 deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8



## Errors

| Error | Cause | Solution |
|------|--------|-----------|
| `RuntimeError: CUDA version mismatch! DeepSpeed ops were compiled and installed with a different version than what is being used at runtime. Please re-install DeepSpeed or switch torch versions. Install CUDA version=11.8, Runtime CUDA version=11.7`  + `AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'`(#138) |  `torch 2.0.1` reinstalled for CUDA 11.7 due to unknown reason :question: | `pip3 install -U torch --index-url https://download.pytorch.org/whl/cu118` |
| `RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7f0f04211eb0>` | Training started, then this error during `forward` due to unknown reason ❓  | ❌  |
| ValueError: Found `optimizer` configured in the DeepSpeed config, but no `scheduler`. Please configure a scheduler in the DeepSpeed config. | Need to add `optimizer` configs, but others work fine without it | Add `optimizer` configs as in the final config above |
| many mismatch errors | hf and ds configs mismatch | axolotl config must not set `optimizer` and `lr_scheduler`, many ds configs need to set to `auto`, add missing ds config keys, see the final config above |
| compile errors when reinstalling deepspeed with `TORCH_CUDA_ARCH_LIST="3.5;5.0;6.0;6.1;7.0;7.5;8.0;8.6+PTX" DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 pip install deepspeed --global-option="build_ext" --global-option="-j8" # --global-option="bdist_wheel"` | Maybe it's not a complete environment for compilation :question: | ❌ |
| `ValueError: Can't find a valid checkpoint at /content/axolotl-trained/falcon-qlora-40b-gsm8k/checkpoint-50` | If I disable deepspeed and use just `accelerate` for multiple GPUs, training is normal but failed to resume from checkpoint ( tried each of latest 3 checkpoints) ❓  | ❌ |
| `ValueError: ZeRO inference only makes sense with ZeRO Stage 3 - please adjust your config` | If I run deepspeed with 1 A100, training is normal but eval fails with this error due to `is_zero3()` return false for eval ❓  | ❌ |
theobjectivedad commented 1 year ago

Hi @utensil, I'm working with DeepSpeed and am having similar issues. Although my process is still broken, I'll share my current config in case it helps. For testing, I have been able to start the the training process on 1 node w/ 3x A6000s under zero 2. Here is my Makefile target:

WORKSPACE_HOST_PATH:=...
MODELS_HOST_PATH:=...
DATA_HOST_PATH:=...
WORK_HOST_PATH:=...

train:
    docker run --gpus='all' -it --rm \
        --volume=$(WORKSPACE_HOST_PATH):/workspace \
        --volume=$(MODELS_HOST_PATH):/models \
        --volume=$(DATA_HOST_PATH):/data \
        --volume=$(WORK_HOST_PATH):/work \
        --volume=$(WORKSPACE_HOST_PATH)/extern/axolotl:/opt/axolotl \
        --env-file=$(CURDIR)/.env \
        --entrypoint=accelerate \
             quay.io/theobjectivedad/axolotl-main:latest \
            launch \
                --config_file /work/accelerate/basic.yaml \
                /opt/axolotl/scripts/finetune.py \
                    /work/atheos/config.yaml

My accelerate config:

compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_config_file: /SET_IN_AXOLOTL_CONFIG.yaml
  zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: "no"
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Deepspeed config:

{
    "optimizer": {
        "type": "auto"
    },
    "scheduler": {
        "type": "auto"
    },
    "activation_checkpointing": {
        "partition_activations": "auto"
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "auto"
        },
        "offload_param": {
            "device": "auto"
        },
        "allgather_bucket_size": "auto",
        "allgather_bucket_dtype": "auto",
        "dp_bucket_size": "auto",
        "overlap_comm": "auto",
        "contiguous_gradients": "auto",
        "sub_group_size": "auto",
        "reduce_bucket_size": "auto"
    },
    "gradient_clipping": "auto",
    "fp16": {
        "enabled": "auto"
    },
    "bf16": {
        "enabled": "auto"
    },
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}

Axolotl config:

###############################################################################
# Model
###############################################################################
base_model: /models/llama-7b-hf
base_model_config: /models/llama-7b-hf
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer

output_dir: /work/atheos/output1

sequence_len: 2048
max_packed_sequence_len: 1024

tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"
  pad_token: "<unk>"
special_tokens:

###############################################################################
# Precision & Model loading
###############################################################################

bf16: full
bfloat16: true

fp16: false
float16: false

tf32: true

load_in_8bit: false
load_in_4bit: false

lora_model_dir:

###############################################################################
# Dataset
###############################################################################
datasets:
  - path: /data/GPTeacher/Instruct
    type: gpteacher

dataset_prepared_path: /work/last_run_prepared
val_set_size: 0.02

###############################################################################
# Training
###############################################################################

deepspeed: /work/accelerate/ds_stage2_auto.json

adapter: lora
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
lora_fan_in_fan_out: false

# WanDB configuration
wandb_project: smoketest
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 4
num_epochs: 8
optimizer:
torchdistx_path:

lr_scheduler: cosine
learning_rate: 5.0e-5
train_on_inputs: false
group_by_length: false

early_stopping_patience: 3

auto_resume_from_checkpoints: true
resume_from_checkpoint:

logging_steps: 500
xformers_attention: true
flash_attention:
gptq_groupsize:
gptq_model_v1:
warmup_steps: 20
eval_steps: 500
save_steps: 500
debug: false

weight_decay: 0.1
fsdp:
fsdp_config:
utensil commented 1 year ago

@theobjectivedad Thanks for sharing the configs, I'll give it a try ASAP. BTW, what do you mean by the "process is still broken" if it already works for multiple GPUs?

utensil commented 1 year ago

Oh, I need to run ZeRO 3, this seems to be the config for ZeRO 2

theobjectivedad commented 1 year ago

Hello @utensil , you are correct - my testing so far has only been w/ ZeRO 2. So far I've been able to run through a short finetuning cycle with the configuration above however I'm not yet able to resume from a checkpoint. This looks suspicious. I'll come back to this again after I complete #291 . Let me know if you make any progress!

theobjectivedad commented 1 year ago

Hello again @utensil , I'd be curious to see if you had any better (or different) results with this image: quay.io/theobjectivedad/axolotl-main:latest

I've added the Dockerfile source and minimal build instruction here.

utensil commented 1 year ago

Hello again @utensil , I'd be curious to see if you had any better (or different) results with this image: quay.io/theobjectivedad/axolotl-main:latest

Sorry, haven't tried the image yet. Where's the source of the image? Is the difference planned to merge with the official docker image?

utensil commented 1 year ago

According to https://github.com/microsoft/DeepSpeed/issues/3775#issuecomment-1639148313 , the main error that's bugging me (RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param) is caused by "with zero-3 some parameters end up having float type, while some others int8".

utensil commented 1 year ago

So I have debugged a bit and confirms that the error is caused by tensor([], device='cuda:0', dtype=torch.bfloat16) v.s. Parameter(Params4bit([], device='cuda:0', dtype=torch.uint8)), i.e. deepspeed might not support 4bit qlora.

I've also tried the workaround, it got past the original error, trained a bit further but still failed.

Modified code:

                for p in params:
                    print(p)
                dtype = torch.bfloat16 # get_only_unique_item(p.dtype for p in params) if not quant else torch.int8

                flat_tensor = torch.empty(partition_sz * world_size,
                                          dtype=dtype,
                                          device=get_accelerator().current_device_name(),
                                          requires_grad=False)

The failure:

Parameter containing:
Parameter(Params4bit([], device='cuda:1', dtype=torch.bfloat16))
Parameter containing:
tensor([], device='cuda:1', dtype=torch.bfloat16)
Parameter containing:
tensor([], device='cuda:1', dtype=torch.bfloat16, requires_grad=True)
Parameter containing:
tensor([], device='cuda:0', dtype=torch.bfloat16)
Parameter containing:
tensor([], device='cuda:0', dtype=torch.bfloat16)
Parameter containing:
tensor([], device='cuda:1', dtype=torch.bfloat16)
Parameter containing:
tensor([], device='cuda:1', dtype=torch.bfloat16)
{'loss': 10.4062, 'learning_rate': 0.0, 'epoch': 0.01}                          
 12%|█████▋                                       | 1/8 [00:05<00:39,  5.63s/it]Traceback (most recent call last):

  File "/workspace/axolotl/scripts/finetune.py", line 341, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 1526, in train
    return inner_training_loop(
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 1796, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 2641, in training_step
    loss = self.compute_loss(model, inputs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 2666, in compute_loss
    outputs = model(**inputs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1769, in forward
    loss = self.module(*inputs, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/peft/peft_model.py", line 922, in forward
    return self.base_model(
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-rw-1b/e4b9872bb803165eb22f0a867d4e6a64d34fce19/modeling_falcon.py", line 900, in forward
    transformer_outputs = self.transformer(
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-rw-1b/e4b9872bb803165eb22f0a867d4e6a64d34fce19/modeling_falcon.py", line 789, in forward
    outputs = torch.utils.checkpoint.checkpoint(
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/root/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-rw-1b/e4b9872bb803165eb22f0a867d4e6a64d34fce19/modeling_falcon.py", line 785, in custom_forward
    return module(*inputs, use_cache=use_cache, output_attentions=output_attentions)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    result = hook(self, args)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 379, in _pre_forward_module_hook
    self.pre_sub_module_forward_function(module)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 491, in pre_sub_module_forward_function
    param_coordinator.fetch_sub_module(sub_module, forward=True)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 371, in fetch_sub_module
    self.__all_gather_params(params_to_prefetch, forward)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 424, in __all_gather_params
    handle = partitioned_params[0].all_gather_coalesced(partitioned_params, forward)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1061, in all_gather_coalesced
    handles = _dist_allgather_fn(
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 84, in _dist_allgather_fn
    return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 312, in allgather_fn
    return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 116, in log_wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 297, in all_gather_into_tensor
    return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 200, in all_gather_into_tensor
    return self.all_gather_function(output_tensor=output_tensor,
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2532, in all_gather_into_tensor
    work = group._allgather_base(output_tensor, input_tensor)
Traceback (most recent call last):
RuntimeError: output tensor must have the same type as input tensor
utensil commented 1 year ago

Related: https://github.com/microsoft/DeepSpeed/issues/3620

casper-hansen commented 10 months ago

Title should be changed to include llama models, I get the RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7f0f04211eb0> error on startup with 4x 4090s with zero3 config.

Aillian commented 9 months ago

any luck solving RuntimeError: expected there to be only one unique element in ... error? i am having the same error

grimulkan commented 9 months ago

I am guessing that a lot more has to change with deepspeed to have it support Qlora. It doesn't even support 8 bit. It sucks, because there is no native tensor or pipeline parallel in HF transformers.

Aillian commented 9 months ago

now i know, you can't use 4/8 bit quantization with deepspeed ZERO 3. also for some reason only A100 GPUs work. I have tried A6000 but they do not work, i get tensor type mismtch errors although i am using fp16

grimulkan commented 9 months ago

Same. I do 4/8-bit training on A6000s (with slow naive MP) and fp/bf-16 on A100s + Deepspeed when I can get access. But the speed difference is quite large. I see no technical reason why we can't have TP/PP working on A6000s or smaller number of GPUs when paired with quantization, but that solution does not exist today I believe (at least with Megatron and/or Deepspeed).

noobmaster29 commented 8 months ago

I'm having similar issues running LORA with zero3. This tutorial suggests its supported. I'm not sure how Axolotl implements train but theoretically zero3 should work with 8bit?

https://huggingface.co/docs/peft/accelerate/deepspeed-zero3-offload