utensil commented 1 year ago

I've been trying to make the combination deepspeed + qlora + falcon work but due to unknown reasons I've stuck in an error maze.

Setup

Docker image: winglian/axolotl-runpod:main-py3.9-cu118-2.0.0
Entry script: bash -c "curl -H 'Cache-Control: no-cache' https://raw.githubusercontent.com/utensil/llm-playground/main/scripts/entry/prepare_ax.sh -sSf | bash"

ds_config.json (final version, modified from the default one in axolotl):

{
"zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
    },
    "offload_param": {
        "device": "cpu",
        "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 0,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 0,
    "stage3_max_reuse_distance": 0,
    "stage3_gather_16bit_weights_on_model_save": true
},
"bf16": {
    "enabled": "auto"
},
"fp16": {
    "enabled": "auto",
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 32,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
},
"optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
},
"scheduler": {
  "type": "WarmupDecayLR",
  "params": {
    "total_num_steps": "auto",
    "warmup_min_lr": "auto",
    "warmup_max_lr": "auto",
    "warmup_num_steps": "auto"
   }
},
"gradient_accumulation_steps": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}

examples/falcon/config-40b-qlora.yml


# 1b: tiiuae/falcon-rw-1b
# 7b: tiiuae/falcon-7b
# 40b: tiiuae/falcon-40b
base_model: tiiuae/falcon-40b
base_model_config: tiiuae/falcon-40b
# required by falcon custom model code: https://huggingface.co/tiiuae/falcon-7b/tree/main
trust_remote_code: true
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
load_in_8bit: false
# enable 4bit for QLoRA
load_in_4bit: true
gptq: false
strict: false

push_dataset_to_hub: utensil hf_use_auth_token: true

datasets:

path: QingyiSi/Alpaca-CoT data_files:
- Chain-of-Thought/formatted_cot_data/gsm8k_train.json type: "alpaca:chat"

dataset_prepared_path: last_run_prepared val_set_size: 0.01

enable QLoRA

adapter: qlora lora_model_dir: sequence_len: 2048 max_packed_sequence_len:

hyperparameters from QLoRA paper Appendix B.2

"We find hyperparameters to be largely robust across datasets"

lora_r: 64 lora_alpha: 16

0.1 for models up to 13B

0.05 for 33B and 65B models

lora_dropout: 0.05

add LoRA modules on all linear layers of the base model

lora_target_modules: lora_target_linear: true lora_fan_in_fan_out:

wandb_project: falcon-qlora wandb_watch: wandb_run_id: wandb_log_model: output_dir: /content/axolotl-trained/falcon-qlora-40b-gsm8k/

QLoRA paper Table 9

- 16 for 7b & 13b

- 32 for 33b, 64 for 64b

Max size tested on A6000

- 7b: 40

- 40b: 4

decrease if OOM, increase for max VRAM utilization

micro_batch_size: 1 gradient_accumulation_steps: 1 num_epochs: 3

Optimizer for QLoRA

optimizer: paged_adamw_32bit

torchdistx_path:

lr_scheduler: cosine

QLoRA paper Table 9

- 2e-4 for 7b & 13b

- 1e-4 for 33b & 64b

learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: true fp16: false tf32: true gradient_checkpointing: true

stop training after this many evaluation losses have increased in a row

https://huggingface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallback

early_stopping_patience: 3 resume_from_checkpoint: auto_resume_from_checkpoints: true local_rank: logging_steps: 1 xformers_attention: true flash_attention: gptq_groupsize: gptq_model_v1: warmup_steps: 10 eval_steps: 5 save_steps: 10 debug: deepspeed: weight_decay: 0.01 fsdp: fsdp_config: special_tokens: pad_token: "<|endoftext|>" bos_token: ">>ABSTRACT<<" eos_token: "<|endoftext|>"

- Environment reported by `ds_report`

Setting ds_accelerator to cuda (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

async_io ............... [YES] ...... [OKAY] cpu_adagrad ............ [YES] ...... [OKAY] cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] quantizer .............. [YES] ...... [OKAY] random_ltd ............. [YES] ...... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [YES] ...... [OKAY] transformer ............ [YES] ...... [OKAY] stochastic_transformer . [YES] ...... [OKAY] transformer_inference .. [YES] ...... [OKAY] utils .................. [YES] ...... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch'] torch version .................... 2.0.1+cu118 deepspeed install path ........... ['/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed'] deepspeed info ................... 0.9.3+52907a66, 52907a66, master torch cuda version ............... 11.8 torch hip version ................ None nvcc version ..................... 11.8 deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8



## Errors

| Error | Cause | Solution |
|------|--------|-----------|
| `RuntimeError: CUDA version mismatch! DeepSpeed ops were compiled and installed with a different version than what is being used at runtime. Please re-install DeepSpeed or switch torch versions. Install CUDA version=11.8, Runtime CUDA version=11.7`  + `AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'`(#138) |  `torch 2.0.1` reinstalled for CUDA 11.7 due to unknown reason :question: | `pip3 install -U torch --index-url https://download.pytorch.org/whl/cu118` |
| `RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7f0f04211eb0>` | Training started, then this error during `forward` due to unknown reason ❓  | ❌  |
| ValueError: Found `optimizer` configured in the DeepSpeed config, but no `scheduler`. Please configure a scheduler in the DeepSpeed config. | Need to add `optimizer` configs, but others work fine without it | Add `optimizer` configs as in the final config above |
| many mismatch errors | hf and ds configs mismatch | axolotl config must not set `optimizer` and `lr_scheduler`, many ds configs need to set to `auto`, add missing ds config keys, see the final config above |
| compile errors when reinstalling deepspeed with `TORCH_CUDA_ARCH_LIST="3.5;5.0;6.0;6.1;7.0;7.5;8.0;8.6+PTX" DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 pip install deepspeed --global-option="build_ext" --global-option="-j8" # --global-option="bdist_wheel"` | Maybe it's not a complete environment for compilation :question: | ❌ |
| `ValueError: Can't find a valid checkpoint at /content/axolotl-trained/falcon-qlora-40b-gsm8k/checkpoint-50` | If I disable deepspeed and use just `accelerate` for multiple GPUs, training is normal but failed to resume from checkpoint ( tried each of latest 3 checkpoints) ❓  | ❌ |
| `ValueError: ZeRO inference only makes sense with ZeRO Stage 3 - please adjust your config` | If I run deepspeed with 1 A100, training is normal but eval fails with this error due to `is_zero3()` return false for eval ❓  | ❌ |

theobjectivedad commented 1 year ago

Hi @utensil, I'm working with DeepSpeed and am having similar issues. Although my process is still broken, I'll share my current config in case it helps. For testing, I have been able to start the the training process on 1 node w/ 3x A6000s under zero 2. Here is my Makefile target:

WORKSPACE_HOST_PATH:=...
MODELS_HOST_PATH:=...
DATA_HOST_PATH:=...
WORK_HOST_PATH:=...

train:
    docker run --gpus='all' -it --rm \
        --volume=$(WORKSPACE_HOST_PATH):/workspace \
        --volume=$(MODELS_HOST_PATH):/models \
        --volume=$(DATA_HOST_PATH):/data \
        --volume=$(WORK_HOST_PATH):/work \
        --volume=$(WORKSPACE_HOST_PATH)/extern/axolotl:/opt/axolotl \
        --env-file=$(CURDIR)/.env \
        --entrypoint=accelerate \
             quay.io/theobjectivedad/axolotl-main:latest \
            launch \
                --config_file /work/accelerate/basic.yaml \
                /opt/axolotl/scripts/finetune.py \
                    /work/atheos/config.yaml

My accelerate config:

compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_config_file: /SET_IN_AXOLOTL_CONFIG.yaml
  zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: "no"
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Deepspeed config:

{
    "optimizer": {
        "type": "auto"
    },
    "scheduler": {
        "type": "auto"
    },
    "activation_checkpointing": {
        "partition_activations": "auto"
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "auto"
        },
        "offload_param": {
            "device": "auto"
        },
        "allgather_bucket_size": "auto",
        "allgather_bucket_dtype": "auto",
        "dp_bucket_size": "auto",
        "overlap_comm": "auto",
        "contiguous_gradients": "auto",
        "sub_group_size": "auto",
        "reduce_bucket_size": "auto"
    },
    "gradient_clipping": "auto",
    "fp16": {
        "enabled": "auto"
    },
    "bf16": {
        "enabled": "auto"
    },
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}

Axolotl config:

###############################################################################
# Model
###############################################################################
base_model: /models/llama-7b-hf
base_model_config: /models/llama-7b-hf
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer

output_dir: /work/atheos/output1

sequence_len: 2048
max_packed_sequence_len: 1024

tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"
  pad_token: "<unk>"
special_tokens:

###############################################################################
# Precision & Model loading
###############################################################################

bf16: full
bfloat16: true

fp16: false
float16: false

tf32: true

load_in_8bit: false
load_in_4bit: false

lora_model_dir:

###############################################################################
# Dataset
###############################################################################
datasets:
  - path: /data/GPTeacher/Instruct
    type: gpteacher

dataset_prepared_path: /work/last_run_prepared
val_set_size: 0.02

###############################################################################
# Training
###############################################################################

deepspeed: /work/accelerate/ds_stage2_auto.json

adapter: lora
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
lora_fan_in_fan_out: false

# WanDB configuration
wandb_project: smoketest
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 4
num_epochs: 8
optimizer:
torchdistx_path:

lr_scheduler: cosine
learning_rate: 5.0e-5
train_on_inputs: false
group_by_length: false

early_stopping_patience: 3

auto_resume_from_checkpoints: true
resume_from_checkpoint:

logging_steps: 500
xformers_attention: true
flash_attention:
gptq_groupsize:
gptq_model_v1:
warmup_steps: 20
eval_steps: 500
save_steps: 500
debug: false

weight_decay: 0.1
fsdp:
fsdp_config:

utensil commented 1 year ago

@theobjectivedad Thanks for sharing the configs, I'll give it a try ASAP. BTW, what do you mean by the "process is still broken" if it already works for multiple GPUs?

utensil commented 1 year ago

Oh, I need to run ZeRO 3, this seems to be the config for ZeRO 2

theobjectivedad commented 1 year ago

Hello @utensil , you are correct - my testing so far has only been w/ ZeRO 2. So far I've been able to run through a short finetuning cycle with the configuration above however I'm not yet able to resume from a checkpoint. This looks suspicious. I'll come back to this again after I complete #291 . Let me know if you make any progress!

theobjectivedad commented 1 year ago

Hello again @utensil , I'd be curious to see if you had any better (or different) results with this image: quay.io/theobjectivedad/axolotl-main:latest

I've added the Dockerfile source and minimal build instruction here.

utensil commented 1 year ago

Hello again @utensil , I'd be curious to see if you had any better (or different) results with this image: quay.io/theobjectivedad/axolotl-main:latest

Sorry, haven't tried the image yet. Where's the source of the image? Is the difference planned to merge with the official docker image?

utensil commented 1 year ago

According to https://github.com/microsoft/DeepSpeed/issues/3775#issuecomment-1639148313 , the main error that's bugging me (RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param) is caused by "with zero-3 some parameters end up having float type, while some others int8".

utensil commented 1 year ago

So I have debugged a bit and confirms that the error is caused by tensor([], device='cuda:0', dtype=torch.bfloat16) v.s. Parameter(Params4bit([], device='cuda:0', dtype=torch.uint8)), i.e. deepspeed might not support 4bit qlora.

I've also tried the workaround, it got past the original error, trained a bit further but still failed.

Modified code:

                for p in params:
                    print(p)
                dtype = torch.bfloat16 # get_only_unique_item(p.dtype for p in params) if not quant else torch.int8

                flat_tensor = torch.empty(partition_sz * world_size,
                                          dtype=dtype,
                                          device=get_accelerator().current_device_name(),
                                          requires_grad=False)

The failure:

Parameter containing:
Parameter(Params4bit([], device='cuda:1', dtype=torch.bfloat16))
Parameter containing:
tensor([], device='cuda:1', dtype=torch.bfloat16)
Parameter containing:
tensor([], device='cuda:1', dtype=torch.bfloat16, requires_grad=True)
Parameter containing:
tensor([], device='cuda:0', dtype=torch.bfloat16)
Parameter containing:
tensor([], device='cuda:0', dtype=torch.bfloat16)
Parameter containing:
tensor([], device='cuda:1', dtype=torch.bfloat16)
Parameter containing:
tensor([], device='cuda:1', dtype=torch.bfloat16)
{'loss': 10.4062, 'learning_rate': 0.0, 'epoch': 0.01}                          
 12%|█████▋                                       | 1/8 [00:05<00:39,  5.63s/it]Traceback (most recent call last):

  File "/workspace/axolotl/scripts/finetune.py", line 341, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 1526, in train
    return inner_training_loop(
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 1796, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 2641, in training_step
    loss = self.compute_loss(model, inputs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 2666, in compute_loss
    outputs = model(**inputs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1769, in forward
    loss = self.module(*inputs, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/peft/peft_model.py", line 922, in forward
    return self.base_model(
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-rw-1b/e4b9872bb803165eb22f0a867d4e6a64d34fce19/modeling_falcon.py", line 900, in forward
    transformer_outputs = self.transformer(
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-rw-1b/e4b9872bb803165eb22f0a867d4e6a64d34fce19/modeling_falcon.py", line 789, in forward
    outputs = torch.utils.checkpoint.checkpoint(
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/root/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-rw-1b/e4b9872bb803165eb22f0a867d4e6a64d34fce19/modeling_falcon.py", line 785, in custom_forward
    return module(*inputs, use_cache=use_cache, output_attentions=output_attentions)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    result = hook(self, args)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 379, in _pre_forward_module_hook
    self.pre_sub_module_forward_function(module)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 491, in pre_sub_module_forward_function
    param_coordinator.fetch_sub_module(sub_module, forward=True)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 371, in fetch_sub_module
    self.__all_gather_params(params_to_prefetch, forward)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 424, in __all_gather_params
    handle = partitioned_params[0].all_gather_coalesced(partitioned_params, forward)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1061, in all_gather_coalesced
    handles = _dist_allgather_fn(
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 84, in _dist_allgather_fn
    return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 312, in allgather_fn
    return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 116, in log_wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 297, in all_gather_into_tensor
    return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 200, in all_gather_into_tensor
    return self.all_gather_function(output_tensor=output_tensor,
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2532, in all_gather_into_tensor
    work = group._allgather_base(output_tensor, input_tensor)
Traceback (most recent call last):
RuntimeError: output tensor must have the same type as input tensor

utensil commented 1 year ago

casper-hansen commented 10 months ago

Title should be changed to include llama models, I get the RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7f0f04211eb0> error on startup with 4x 4090s with zero3 config.

Aillian commented 9 months ago

any luck solving RuntimeError: expected there to be only one unique element in ... error? i am having the same error

grimulkan commented 9 months ago

I am guessing that a lot more has to change with deepspeed to have it support Qlora. It doesn't even support 8 bit. It sucks, because there is no native tensor or pipeline parallel in HF transformers.

Aillian commented 9 months ago

now i know, you can't use 4/8 bit quantization with deepspeed ZERO 3. also for some reason only A100 GPUs work. I have tried A6000 but they do not work, i get tensor type mismtch errors although i am using fp16

grimulkan commented 9 months ago

Same. I do 4/8-bit training on A6000s (with slow naive MP) and fp/bf-16 on A100s + Deepspeed when I can get access. But the speed difference is quite large. I see no technical reason why we can't have TP/PP working on A6000s or smaller number of GPUs when paired with quantization, but that solution does not exist today I believe (at least with Megatron and/or Deepspeed).

noobmaster29 commented 8 months ago

I'm having similar issues running LORA with zero3. This tutorial suggests its supported. I'm not sure how Axolotl implements train but theoretically zero3 should work with 8bit?

https://huggingface.co/docs/peft/accelerate/deepspeed-zero3-offload

axolotl-ai-cloud / axolotl

The error maze of deepspeed + qlora + falcon #207

Setup

enable QLoRA

hyperparameters from QLoRA paper Appendix B.2

"We find hyperparameters to be largely robust across datasets"

0.1 for models up to 13B

0.05 for 33B and 65B models

add LoRA modules on all linear layers of the base model

QLoRA paper Table 9

- 16 for 7b & 13b

- 32 for 33b, 64 for 64b

Max size tested on A6000

- 7b: 40

- 40b: 4

decrease if OOM, increase for max VRAM utilization

Optimizer for QLoRA

optimizer: paged_adamw_32bit

lr_scheduler: cosine

QLoRA paper Table 9

- 2e-4 for 7b & 13b

- 1e-4 for 33b & 64b

stop training after this many evaluation losses have increased in a row

https://huggingface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallback

Setting ds_accelerator to cuda (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible