Error unfreezing intermediate layers

ayushsml commented 1 month ago

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Training should run properly while updating parameters of transformer layers.

Current behaviour

Error

 File "~/axolotl/env/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 149, in __init__
[rank2]:     self.dtype = self.optimizer.param_groups[0]['params'][0].dtype
[rank2]: IndexError: list index out of range

Steps to reproduce

When I unfreeze embed_tokens or lm_head or do not freeze anything. The training runs as expected. When I freeze the transformer layers, the error occurs.

Config yaml

base_model: ~/results/run1/checkpoint-2000
model_type: AutoModelForCausalLM
tokenizer_config: ~/tokenizer/final_tokenizer_hf
tokenizer_type: LlamaTokenizer
trust_remote_code: true
# Resize the model embeddings when new tokens are added to multiples of 32
# This is reported to improve training speed on some models
resize_token_embeddings_to_32x: true

load_in_8bit: false
load_in_4bit: false
strict: false

unfrozen_parameters:
  - transformer.blocks.[0-7].
  # - ^lm_head.weight$
  # - ^model.embed_tokens.weight$

model_config:
  output_router_logits: true

datasets:
  - path: json 
    type: "completion"
    data_files: ~/data/data.jsonl
    ds_type: json
output_dir: ~/results/

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true
logging_steps: 1
warmup_steps: 10
gradient_accumulation_steps: 4
micro_batch_size: 8
num_epochs: 3
max_steps: 2000
eval_steps: 100
optimizer: adamw_hf
lr_scheduler: cosine
learning_rate: 0.0001

# wandb_project:
# wandb_key : 
# wandb_entity: 
# wandb_name: 
# wandb_log_model: 

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

save_total_limit: 1
save_steps: 100
debug:
deepspeed: ~/axolotl/deepspeed_configs/zero3_bf16_cpuoffload_all.json
weight_decay: 0.0
fsdp:
fsdp_config:

Possible solution

No response

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

winglian commented 1 month ago

transformer.blocks.[0-7]. doesn't match up with any llama models. It looks like you're using the llama tokenizer though. There isn't enough information here for me to help without knowing the model architecture.

ayushbits commented 1 month ago

Thanks for reply @winglian . I am using base model as llamav2. I tried with the below configuration as well. However, I received the same error as mentioned earlier.

unfrozen_parameters:
       model.layers.*

axolotl-ai-cloud / axolotl