Open hahmad2008 opened 7 months ago
You need to use fsdp+qlora for it to work across gpus. Otherwise it tries to load the entire model on each gpu.
@winglian , I used FSDP with qlora and the model still loaded as copied to the GPUs.
I tried it with passing accelerate config and without and having the same behavior:
acc-config.yaml
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
config.yaml
base_model: mistralai/Mixtral-8x7B-v0.1
model_type: AutoModelForCausalLM
tokenizer_type: LlamaTokenizer
trust_remote_code: true
load_in_8bit: false
load_in_4bit: true
strict: false
datasets:
- path: mhenrichsen/alpaca_2k_test
type: alpaca
dataset_prepared_path: prepared-dataset
val_set_size: 0.0
output_dir: finetuned-model
## You can optionally freeze the entire model and unfreeze a subset of parameters
unfrozen_parameters:
# - lm_head.*
# - model.embed_tokens.*
# - model.layers.2[0-9]+.block_sparse_moe.gate.*
# - model.layers.2[0-9]+.block_sparse_moe.experts.*
# - model.layers.3[0-9]+.block_sparse_moe.gate.*
# - model.layers.3[0-9]+.block_sparse_moe.experts.*
model_config:
output_router_logits: true
adapter: qlora
lora_model_dir:
sequence_len: 256
sample_packing: true
pad_to_sequence_len: true
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
#lora_target_modules:
# - gate
# - q_proj
# - k_proj
# - v_proj
# - o_proj
# - w1
# - w2
# - w3
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
gradient_accumulation_steps: 2
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3
warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_table_max_new_tokens: 128
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
- full_shard
- auto_wrap
fsdp_config:
fsdp_offload_params: false
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
special_tokens:
seems I need to enable this fsdp_offload_params: true
Please check that this issue hasn't been reported before.
Expected Behavior
I am trying to finetune mistralai/Mixtral-8x7B-v0.1 using the examples in the repo, but I got CUDA out of memory.
I am using 4 A10 GPUs each with 20G memory. I am finetuning using qlora with deepspeed zero2
Current behaviour
Out of memory error while loading the model on GPUs.
Command:
accelerate launch scripts/finetune.py config.yaml --deepspeed deepspeed_configs/zero2.json
Steps to reproduce
Command:
accelerate launch scripts/finetune.py config.yaml --deepspeed deepspeed_configs/zero2.json
using this commit: c67fb7158312e47e3326f077f74485cf0a23b51a
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
c67fb7158312e47e3326f077f74485cf0a23b51a
Acknowledgements