Open utensil opened 1 year ago
Hi @utensil, I'm working with DeepSpeed and am having similar issues. Although my process is still broken, I'll share my current config in case it helps. For testing, I have been able to start the the training process on 1 node w/ 3x A6000s under zero 2. Here is my Makefile
target:
WORKSPACE_HOST_PATH:=...
MODELS_HOST_PATH:=...
DATA_HOST_PATH:=...
WORK_HOST_PATH:=...
train:
docker run --gpus='all' -it --rm \
--volume=$(WORKSPACE_HOST_PATH):/workspace \
--volume=$(MODELS_HOST_PATH):/models \
--volume=$(DATA_HOST_PATH):/data \
--volume=$(WORK_HOST_PATH):/work \
--volume=$(WORKSPACE_HOST_PATH)/extern/axolotl:/opt/axolotl \
--env-file=$(CURDIR)/.env \
--entrypoint=accelerate \
quay.io/theobjectivedad/axolotl-main:latest \
launch \
--config_file /work/accelerate/basic.yaml \
/opt/axolotl/scripts/finetune.py \
/work/atheos/config.yaml
My accelerate config:
compute_environment: LOCAL_MACHINE
deepspeed_config:
deepspeed_config_file: /SET_IN_AXOLOTL_CONFIG.yaml
zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: "no"
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Deepspeed config:
{
"optimizer": {
"type": "auto"
},
"scheduler": {
"type": "auto"
},
"activation_checkpointing": {
"partition_activations": "auto"
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "auto"
},
"offload_param": {
"device": "auto"
},
"allgather_bucket_size": "auto",
"allgather_bucket_dtype": "auto",
"dp_bucket_size": "auto",
"overlap_comm": "auto",
"contiguous_gradients": "auto",
"sub_group_size": "auto",
"reduce_bucket_size": "auto"
},
"gradient_clipping": "auto",
"fp16": {
"enabled": "auto"
},
"bf16": {
"enabled": "auto"
},
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto"
}
Axolotl config:
###############################################################################
# Model
###############################################################################
base_model: /models/llama-7b-hf
base_model_config: /models/llama-7b-hf
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
output_dir: /work/atheos/output1
sequence_len: 2048
max_packed_sequence_len: 1024
tokens:
bos_token: "<s>"
eos_token: "</s>"
unk_token: "<unk>"
pad_token: "<unk>"
special_tokens:
###############################################################################
# Precision & Model loading
###############################################################################
bf16: full
bfloat16: true
fp16: false
float16: false
tf32: true
load_in_8bit: false
load_in_4bit: false
lora_model_dir:
###############################################################################
# Dataset
###############################################################################
datasets:
- path: /data/GPTeacher/Instruct
type: gpteacher
dataset_prepared_path: /work/last_run_prepared
val_set_size: 0.02
###############################################################################
# Training
###############################################################################
deepspeed: /work/accelerate/ds_stage2_auto.json
adapter: lora
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
- q_proj
- v_proj
lora_fan_in_fan_out: false
# WanDB configuration
wandb_project: smoketest
wandb_watch:
wandb_run_id:
wandb_log_model:
gradient_accumulation_steps: 1
micro_batch_size: 4
num_epochs: 8
optimizer:
torchdistx_path:
lr_scheduler: cosine
learning_rate: 5.0e-5
train_on_inputs: false
group_by_length: false
early_stopping_patience: 3
auto_resume_from_checkpoints: true
resume_from_checkpoint:
logging_steps: 500
xformers_attention: true
flash_attention:
gptq_groupsize:
gptq_model_v1:
warmup_steps: 20
eval_steps: 500
save_steps: 500
debug: false
weight_decay: 0.1
fsdp:
fsdp_config:
@theobjectivedad Thanks for sharing the configs, I'll give it a try ASAP. BTW, what do you mean by the "process is still broken" if it already works for multiple GPUs?
Oh, I need to run ZeRO 3, this seems to be the config for ZeRO 2
Hello @utensil , you are correct - my testing so far has only been w/ ZeRO 2. So far I've been able to run through a short finetuning cycle with the configuration above however I'm not yet able to resume from a checkpoint. This looks suspicious. I'll come back to this again after I complete #291 . Let me know if you make any progress!
Hello again @utensil , I'd be curious to see if you had any better (or different) results with this image: quay.io/theobjectivedad/axolotl-main:latest
I've added the Dockerfile source and minimal build instruction here.
Hello again @utensil , I'd be curious to see if you had any better (or different) results with this image: quay.io/theobjectivedad/axolotl-main:latest
Sorry, haven't tried the image yet. Where's the source of the image? Is the difference planned to merge with the official docker image?
According to https://github.com/microsoft/DeepSpeed/issues/3775#issuecomment-1639148313 , the main error that's bugging me (RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param
) is caused by "with zero-3 some parameters end up having float type, while some others int8".
So I have debugged a bit and confirms that the error is caused by tensor([], device='cuda:0', dtype=torch.bfloat16)
v.s. Parameter(Params4bit([], device='cuda:0', dtype=torch.uint8))
, i.e. deepspeed might not support 4bit qlora.
I've also tried the workaround, it got past the original error, trained a bit further but still failed.
Modified code:
for p in params:
print(p)
dtype = torch.bfloat16 # get_only_unique_item(p.dtype for p in params) if not quant else torch.int8
flat_tensor = torch.empty(partition_sz * world_size,
dtype=dtype,
device=get_accelerator().current_device_name(),
requires_grad=False)
The failure:
Parameter containing:
Parameter(Params4bit([], device='cuda:1', dtype=torch.bfloat16))
Parameter containing:
tensor([], device='cuda:1', dtype=torch.bfloat16)
Parameter containing:
tensor([], device='cuda:1', dtype=torch.bfloat16, requires_grad=True)
Parameter containing:
tensor([], device='cuda:0', dtype=torch.bfloat16)
Parameter containing:
tensor([], device='cuda:0', dtype=torch.bfloat16)
Parameter containing:
tensor([], device='cuda:1', dtype=torch.bfloat16)
Parameter containing:
tensor([], device='cuda:1', dtype=torch.bfloat16)
{'loss': 10.4062, 'learning_rate': 0.0, 'epoch': 0.01}
12%|█████▋ | 1/8 [00:05<00:39, 5.63s/it]Traceback (most recent call last):
File "/workspace/axolotl/scripts/finetune.py", line 341, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 1526, in train
return inner_training_loop(
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 1796, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 2641, in training_step
loss = self.compute_loss(model, inputs)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 2666, in compute_loss
outputs = model(**inputs)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1769, in forward
loss = self.module(*inputs, **kwargs)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/peft/peft_model.py", line 922, in forward
return self.base_model(
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-rw-1b/e4b9872bb803165eb22f0a867d4e6a64d34fce19/modeling_falcon.py", line 900, in forward
transformer_outputs = self.transformer(
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-rw-1b/e4b9872bb803165eb22f0a867d4e6a64d34fce19/modeling_falcon.py", line 789, in forward
outputs = torch.utils.checkpoint.checkpoint(
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/root/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-rw-1b/e4b9872bb803165eb22f0a867d4e6a64d34fce19/modeling_falcon.py", line 785, in custom_forward
return module(*inputs, use_cache=use_cache, output_attentions=output_attentions)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
result = hook(self, args)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 379, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 491, in pre_sub_module_forward_function
param_coordinator.fetch_sub_module(sub_module, forward=True)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 371, in fetch_sub_module
self.__all_gather_params(params_to_prefetch, forward)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 424, in __all_gather_params
handle = partitioned_params[0].all_gather_coalesced(partitioned_params, forward)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1061, in all_gather_coalesced
handles = _dist_allgather_fn(
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 84, in _dist_allgather_fn
return instrument_w_nvtx(dist.allgather_fn)(output_tensor, input_tensor, group=group, async_op=True)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 312, in allgather_fn
return all_gather_into_tensor(output_tensor, input_tensor, group=group, async_op=async_op, debug=debug)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 116, in log_wrapper
return func(*args, **kwargs)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 297, in all_gather_into_tensor
return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 200, in all_gather_into_tensor
return self.all_gather_function(output_tensor=output_tensor,
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
return func(*args, **kwargs)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2532, in all_gather_into_tensor
work = group._allgather_base(output_tensor, input_tensor)
Traceback (most recent call last):
RuntimeError: output tensor must have the same type as input tensor
Title should be changed to include llama models, I get the RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7f0f04211eb0>
error on startup with 4x 4090s with zero3 config.
any luck solving RuntimeError: expected there to be only one unique element in ...
error? i am having the same error
I am guessing that a lot more has to change with deepspeed to have it support Qlora. It doesn't even support 8 bit. It sucks, because there is no native tensor or pipeline parallel in HF transformers.
now i know, you can't use 4/8 bit quantization with deepspeed ZERO 3. also for some reason only A100 GPUs work. I have tried A6000 but they do not work, i get tensor type mismtch errors although i am using fp16
Same. I do 4/8-bit training on A6000s (with slow naive MP) and fp/bf-16 on A100s + Deepspeed when I can get access. But the speed difference is quite large. I see no technical reason why we can't have TP/PP working on A6000s or smaller number of GPUs when paired with quantization, but that solution does not exist today I believe (at least with Megatron and/or Deepspeed).
I'm having similar issues running LORA with zero3. This tutorial suggests its supported. I'm not sure how Axolotl implements train but theoretically zero3 should work with 8bit?
https://huggingface.co/docs/peft/accelerate/deepspeed-zero3-offload
I've been trying to make the combination
deepspeed + qlora + falcon
work but due to unknown reasons I've stuck in an error maze.Setup
winglian/axolotl-runpod:main-py3.9-cu118-2.0.0
bash -c "curl -H 'Cache-Control: no-cache' https://raw.githubusercontent.com/utensil/llm-playground/main/scripts/entry/prepare_ax.sh -sSf | bash"
ds_config.json
(final version, modified from the default one in axolotl):examples/falcon/config-40b-qlora.yml
push_dataset_to_hub: utensil hf_use_auth_token: true
datasets:
dataset_prepared_path: last_run_prepared val_set_size: 0.01
enable QLoRA
adapter: qlora lora_model_dir: sequence_len: 2048 max_packed_sequence_len:
hyperparameters from QLoRA paper Appendix B.2
"We find hyperparameters to be largely robust across datasets"
lora_r: 64 lora_alpha: 16
0.1 for models up to 13B
0.05 for 33B and 65B models
lora_dropout: 0.05
add LoRA modules on all linear layers of the base model
lora_target_modules: lora_target_linear: true lora_fan_in_fan_out:
wandb_project: falcon-qlora wandb_watch: wandb_run_id: wandb_log_model: output_dir: /content/axolotl-trained/falcon-qlora-40b-gsm8k/
QLoRA paper Table 9
- 16 for 7b & 13b
- 32 for 33b, 64 for 64b
Max size tested on A6000
- 7b: 40
- 40b: 4
decrease if OOM, increase for max VRAM utilization
micro_batch_size: 1 gradient_accumulation_steps: 1 num_epochs: 3
Optimizer for QLoRA
optimizer: paged_adamw_32bit
torchdistx_path:
lr_scheduler: cosine
QLoRA paper Table 9
- 2e-4 for 7b & 13b
- 1e-4 for 33b & 64b
learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: true fp16: false tf32: true gradient_checkpointing: true
stop training after this many evaluation losses have increased in a row
https://huggingface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallback
early_stopping_patience: 3 resume_from_checkpoint: auto_resume_from_checkpoints: true local_rank: logging_steps: 1 xformers_attention: true flash_attention: gptq_groupsize: gptq_model_v1: warmup_steps: 10 eval_steps: 5 save_steps: 10 debug: deepspeed: weight_decay: 0.01 fsdp: fsdp_config: special_tokens: pad_token: "<|endoftext|>" bos_token: ">>ABSTRACT<<" eos_token: "<|endoftext|>"
Setting ds_accelerator to cuda (auto detect)
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
async_io ............... [YES] ...... [OKAY] cpu_adagrad ............ [YES] ...... [OKAY] cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] quantizer .............. [YES] ...... [OKAY] random_ltd ............. [YES] ...... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [YES] ...... [OKAY] transformer ............ [YES] ...... [OKAY] stochastic_transformer . [YES] ...... [OKAY] transformer_inference .. [YES] ...... [OKAY] utils .................. [YES] ...... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch'] torch version .................... 2.0.1+cu118 deepspeed install path ........... ['/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed'] deepspeed info ................... 0.9.3+52907a66, 52907a66, master torch cuda version ............... 11.8 torch hip version ................ None nvcc version ..................... 11.8 deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8