Please check that this issue hasn't been reported before.
[X] I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
Training should run properly while updating parameters of transformer layers.
Current behaviour
Error
File "~/axolotl/env/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 149, in __init__
[rank2]: self.dtype = self.optimizer.param_groups[0]['params'][0].dtype
[rank2]: IndexError: list index out of range
Steps to reproduce
When I unfreeze embed_tokens or lm_head or do not freeze anything. The training runs as expected.
When I freeze the transformer layers, the error occurs.
Config yaml
base_model: ~/results/run1/checkpoint-2000
model_type: AutoModelForCausalLM
tokenizer_config: ~/tokenizer/final_tokenizer_hf
tokenizer_type: LlamaTokenizer
trust_remote_code: true
# Resize the model embeddings when new tokens are added to multiples of 32
# This is reported to improve training speed on some models
resize_token_embeddings_to_32x: true
load_in_8bit: false
load_in_4bit: false
strict: false
unfrozen_parameters:
- transformer.blocks.[0-7].
# - ^lm_head.weight$
# - ^model.embed_tokens.weight$
model_config:
output_router_logits: true
datasets:
- path: json
type: "completion"
data_files: ~/data/data.jsonl
ds_type: json
output_dir: ~/results/
sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true
logging_steps: 1
warmup_steps: 10
gradient_accumulation_steps: 4
micro_batch_size: 8
num_epochs: 3
max_steps: 2000
eval_steps: 100
optimizer: adamw_hf
lr_scheduler: cosine
learning_rate: 0.0001
# wandb_project:
# wandb_key :
# wandb_entity:
# wandb_name:
# wandb_log_model:
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
save_total_limit: 1
save_steps: 100
debug:
deepspeed: ~/axolotl/deepspeed_configs/zero3_bf16_cpuoffload_all.json
weight_decay: 0.0
fsdp:
fsdp_config:
Possible solution
No response
Which Operating Systems are you using?
[X] Linux
[ ] macOS
[ ] Windows
Python Version
3.10
axolotl branch-commit
main
Acknowledgements
[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.
transformer.blocks.[0-7]. doesn't match up with any llama models. It looks like you're using the llama tokenizer though. There isn't enough information here for me to help without knowing the model architecture.
Thanks for reply @winglian .
I am using base model as llamav2. I tried with the below configuration as well. However, I received the same error as mentioned earlier.
Please check that this issue hasn't been reported before.
Expected Behavior
Training should run properly while updating parameters of transformer layers.
Current behaviour
Error
Steps to reproduce
When I unfreeze
embed_tokens
orlm_head
or do not freeze anything. The training runs as expected. When I freeze the transformer layers, the error occurs.Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main
Acknowledgements