Open Nero10578 opened 1 week ago
Hey, thanks for the report.
Could also be because of changes to using chat_templates?
Are you able to test one with chat_template before the GA patch (making sure you install the old requirements)?
@Nero10578 btw, I say you're using FSDP, how many GPUs? thanks
It could also be that the grad norm is higher because it doesn't expect the repeating of roles
Hey, thanks for the report.
Could also be because of changes to using chat_templates?
Are you able to test one with chat_template before the GA patch (making sure you install the old requirements)?
Will have to test, but I suspect this is the source of the issue.
@Nero10578 btw, I say you're using FSDP, how many GPUs? thanks
I am using 2x3090Ti with FSDP with CPU offloading.
It could also be that the grad norm is higher because it doesn't expect the repeating of roles
I don't think so because the dataset didn't change and it was fine before.
So tested it out with the commit before the GA patch 718cfb2dd1ff2a03b89e3b95f0b1aa1e04046e6e and with the latest commit 724b660d5632adc842e062c1588b325211ce48a1. These are the differences in grad_norm:
Before GA patch:
After GA Patch:
The thing is that now that I have let both run, the GA patch does make the loss better, can be seen in how the first eval shows lower loss after the GA patch.
Before GA patch:
After GA Patch:
I am using a different config with Liger kernels enabled:
base_model: /home/user/models/Mistral-Nemo-Instruct-2407
model_type: AutoModelForCausalLM
train_on_inputs: false
group_by_length: false
load_in_8bit:
load_in_4bit: false
strict: false
sequence_len: 8192
bf16: auto
flash_attention: true
shuffle_merged_datasets: true
# Data
datasets:
- path: /home/user/datasets/conversations-escaped.jsonl
type: chat_template
field_messages: conversations
message_field_role: from
message_field_content: value
warmup_steps: 10
dataset_prepared_path: ./lora_last_run_prepared
# Iterations
num_epochs: 1
saves_per_epoch: 8
saves_total_limit: 8
# Evaluation
val_set_size: 0.0025
eval_max_new_tokens: 128
eval_sample_packing: false
evals_per_epoch: 8
eval_table_size:
# LoRA
output_dir: ./lora_out3
adapter: lora
lora_model_dir:
lora_r: 64
lora_alpha: 128
lora_dropout: 0.05
lora_target_linear: true
peft_use_rslora: false
loraplus_lr_ratio: 16
save_safetensors: true
# Sampling
sample_packing: true
pad_to_sequence_len: true
# Batching
gradient_accumulation_steps: 16
micro_batch_size: 1
gradient_checkpointing: unsloth
# wandb
wandb_mode: # "offline" to save run metadata locally and not sync to the server, "disabled" to turn off wandb
wandb_project: mistral-nemo-v1
wandb_entity: # A wandb Team name if using a Team
wandb_watch:
wandb_name: loraplus-nemo--8192
wandb_run_id: # Set the ID of your wandb run
wandb_log_model: # "checkpoint" to log model to wandb Artifacts every `save_steps` or "end" to log only at the end of training
# Optimizer
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.00001
# Misc
auto_resume_from_checkpoints: true
logging_steps: 1
weight_decay: 0.0
special_tokens:
pad_token: <pad>
plugins:
- axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_layer_norm: true
liger_fused_linear_cross_entropy: true
# Multi-GPU
deepspeed:
fsdp:
- full_shard
- auto_wrap
fsdp_config:
fsdp_limit_all_gathers: true
fsdp_sync_module_states: true
fsdp_offload_params: true
fsdp_use_orig_params: false
fsdp_cpu_ram_efficient_loading: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: MistralDecoderLayer
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sharding_strategy: FULL_SHARD
Please check that this issue hasn't been reported before.
Expected Behavior
Before the gradient accumulation fixes and changes with transformers recently, the grad_norm when training Mistral Nemo 12B was below 1.0 like normal. Could also be because of changes to using chat_templates?
This was using the same config with previous versions of axolotl and transformers:
Current behaviour
Gradient Normalization is now around 5 when training:
Steps to reproduce
Train Mistral Nemo 12B Instruct with LoRA. I used the same config as I did back then when this works fine.
Only difference is now I am using chat_templates, where I replace the chat template in the Mistral tokenizer_config.json with the chat template shown here so that it can accept repeating same roles.
I did this by changing the chat templates in the Mistral Nemo tokenizer config to this:
If this is the wrong way to do it, that might be causing the high grad_norm? But that seems unlikely since the dataset seems to be tokenized properly when I use preprocess --debug.
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.11
axolotl branch-commit
main/db51a9e4
Acknowledgements