axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.85k stars 864 forks source link

Model is not saved for full finetune with Deepspeed Zero3 #1223

Open hahmad2008 opened 9 months ago

hahmad2008 commented 9 months ago

Please check that this issue hasn't been reported before.

Expected Behavior

With full finetune, expect a model with size 2G in the output directory however, the model directory size is 1M! The checkpoint should be 2G since we enable fp16.

Current behaviour

For TinyLLama, for full finetune, the model is not saved in the model directory.

Config

accelerate-config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
deepspeed_config:
  deepspeed_config_file: deepspeed/zero3.json
  zero3_init_flag: true
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
downcast_bf16: 'no'
dynamo_backend: 'NO'
command_file: null
commands: null
fsdp_config: {}
tpu_name: null
tpu_zone: null
use_cpu: false

config.yaml

base_model: TinyLlama/TinyLlama-1.1B-step-50K-105b
base_model_config: TinyLlama/TinyLlama-1.1B-step-50K-105b
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
tokenizer_legacy: false
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: train.jsonl
    type: completion
    field: text
dataset_prepared_path: prepared-dataset
val_set_size: 0.08
output_dir: model-finetuned

adapter: 
lora_model_dir:

sequence_len: 512
sample_packing: false
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 1
eval_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: false

warmup_steps: 10
evals_per_epoch: 1
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

deepspeed/zero3.json

{
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 0,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 0,
    "stage3_max_reuse_distance": 0,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "bf16": {
    "enabled": "auto"
  },
  "fp16": {
    "enabled": "auto",
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 32,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "gradient_accumulation_steps": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Final Model

ls -lh model-finetuned

total 936K
-rw-r--r-- 1 root root 3.4K Jan 28 13:22 README.md
drwxr-xr-x 3 root root 4.0K Jan 28 13:22 checkpoint-4130
-rw-r--r-- 1 root root  697 Jan 28 13:22 config.json
-rw-r--r-- 1 root root  145 Jan 28 13:22 generation_config.json
-rw-r--r-- 1 root root 415K Jan 28 13:22 pytorch_model.bin
drwxr-xr-x 4 root root 4.0K Jan 28 10:54 runs
-rw-r--r-- 1 root root  437 Jan 28 10:54 special_tokens_map.json
-rw-r--r-- 1 root root 489K Jan 28 10:54 tokenizer.model
-rw-r--r-- 1 root root 1012 Jan 28 10:54 tokenizer_config.json

ls -lh model-finetuned/checkpoint-4130/

total 4.1G
-rw-r--r-- 1 root root  697 Jan 28 13:19 config.json
-rw-r--r-- 1 root root  145 Jan 28 13:19 generation_config.json
drwxr-xr-x 2 root root 4.0K Jan 28 13:20 global_step4130
-rw-r--r-- 1 root root   15 Jan 28 13:22 latest
-rw-r--r-- 1 root root 4.1G Jan 28 13:20 model.safetensors
-rw-r--r-- 1 root root  16K Jan 28 13:22 rng_state_0.pth
-rw-r--r-- 1 root root  16K Jan 28 13:22 rng_state_1.pth
-rw-r--r-- 1 root root  627 Jan 28 13:22 scheduler.pt
-rw-r--r-- 1 root root 489K Jan 28 13:22 trainer_state.json
-rw-r--r-- 1 root root 6.4K Jan 28 13:20 training_args.bin
-rwxr--r-- 1 root root  24K Jan 28 13:22 zero_to_fp32.py

Also fp16 is not applied here, the checkpoint is 4G! not 2G.

Steps to reproduce

For TinyLLama, for full finetune, the model is not saved in the model directory.

Command

accelerate launch --config_file accelerate-config.yaml scripts/finetune.py axolotl/config.yaml

Config

accelerate-config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
deepspeed_config:
  deepspeed_config_file: deepspeed/zero3.json
  zero3_init_flag: true
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
downcast_bf16: 'no'
dynamo_backend: 'NO'
command_file: null
commands: null
fsdp_config: {}
tpu_name: null
tpu_zone: null
use_cpu: false

config.yaml

base_model: TinyLlama/TinyLlama-1.1B-step-50K-105b
base_model_config: TinyLlama/TinyLlama-1.1B-step-50K-105b
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
tokenizer_legacy: false
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: train.jsonl
    type: completion
    field: text
dataset_prepared_path: prepared-dataset
val_set_size: 0.08
output_dir: model-finetuned

adapter: 
lora_model_dir:

sequence_len: 512
sample_packing: false
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 1
eval_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: false

warmup_steps: 10
evals_per_epoch: 1
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

deepspeed/zero3.json

{
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 0,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 0,
    "stage3_max_reuse_distance": 0,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "bf16": {
    "enabled": "auto"
  },
  "fp16": {
    "enabled": "auto",
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 32,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "gradient_accumulation_steps": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Final Model

ls -lh model-finetuned

total 936K
-rw-r--r-- 1 root root 3.4K Jan 28 13:22 README.md
drwxr-xr-x 3 root root 4.0K Jan 28 13:22 checkpoint-4130
-rw-r--r-- 1 root root  697 Jan 28 13:22 config.json
-rw-r--r-- 1 root root  145 Jan 28 13:22 generation_config.json
-rw-r--r-- 1 root root 415K Jan 28 13:22 pytorch_model.bin
drwxr-xr-x 4 root root 4.0K Jan 28 10:54 runs
-rw-r--r-- 1 root root  437 Jan 28 10:54 special_tokens_map.json
-rw-r--r-- 1 root root 489K Jan 28 10:54 tokenizer.model
-rw-r--r-- 1 root root 1012 Jan 28 10:54 tokenizer_config.json

ls -lh model-finetuned/checkpoint-4130/

total 4.1G
-rw-r--r-- 1 root root  697 Jan 28 13:19 config.json
-rw-r--r-- 1 root root  145 Jan 28 13:19 generation_config.json
drwxr-xr-x 2 root root 4.0K Jan 28 13:20 global_step4130
-rw-r--r-- 1 root root   15 Jan 28 13:22 latest
-rw-r--r-- 1 root root 4.1G Jan 28 13:20 model.safetensors
-rw-r--r-- 1 root root  16K Jan 28 13:22 rng_state_0.pth
-rw-r--r-- 1 root root  16K Jan 28 13:22 rng_state_1.pth
-rw-r--r-- 1 root root  627 Jan 28 13:22 scheduler.pt
-rw-r--r-- 1 root root 489K Jan 28 13:22 trainer_state.json
-rw-r--r-- 1 root root 6.4K Jan 28 13:20 training_args.bin
-rwxr--r-- 1 root root  24K Jan 28 13:22 zero_to_fp32.py

Also fp16 is not applied here, the checkpoint is 4G! not 2G.

Config yaml

config.yaml

base_model: TinyLlama/TinyLlama-1.1B-step-50K-105b
base_model_config: TinyLlama/TinyLlama-1.1B-step-50K-105b
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
tokenizer_legacy: false
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: train.jsonl
    type: completion
    field: text
dataset_prepared_path: prepared-dataset
val_set_size: 0.08
output_dir: model-finetuned

adapter: 
lora_model_dir:

sequence_len: 512
sample_packing: false
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 1
eval_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: false

warmup_steps: 10
evals_per_epoch: 1
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

Possible solution

I checked this issue, however the latest branch doesn't solve the problem.

Which Operating Systems are you using?

Python Version

3.11

axolotl branch-commit

main

Acknowledgements

NanoCode012 commented 9 months ago

Was there any stack trace error? Did you run out of space? Did the run abruptly quit?

hahmad2008 commented 9 months ago

@NanoCode012 No at all.

antonpolishko commented 4 months ago

Also fp16 is not applied here, the checkpoint is 4G! not 2G.

Not sure how axolotl should behave, but from other experience optimizer state (gradients) might be part of the weights ballooning the space. We always do an import->save weights_only=True trick for the weights we actually keep.

As for the model folder being small, we ran into similar issue that we needed to use the checkpoint folder as the point to launch inference playground instead of the model folder, like so

python -m axolotl.cli.inference config.yaml --base_model='model-finetuned/checkpoint-4130/' --gradio
NanoCode012 commented 2 days ago

@hahmad2008 , sorry I didn't get to follow up. I have recently used deepspeed z3 for training, and it worked. Perhaps the issue is now solved?