loss spike when training qwen1.5 with sample_packing:true

smhd001 commented 5 months ago

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

no spike in loss

Current behaviour

Steps to reproduce

train a model with following config

Config yaml

base_model: Qwen/Qwen1.5-7B-Chat
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: true
load_in_4bit: false

strict: false

datasets:
  - path: ...
    type: sharegpt
    system_promp: You are a helpful assistant.
    field_human: user
    field_model: assistant
    data_files:

val_set_size: 0.001
output_dir: ./lora-out

sequence_len: 2048
sample_packing: true
pad_to_sequence_len: true

adapter: lora
lora_model_dir:
lora_r: 512
lora_alpha: 256
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project: ...
wandb_entity:
wandb_watch:
wandb_name: ...
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 6
num_epochs: 1
optimizer: adamw_bnb_8bit
adam_beta2: 0.95
lr_scheduler: cosine
learning_rate: 0.0003

train_on_inputs: false
group_by_length: false
bf16: true
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1

flash_attention: true

warmup_steps: 5
eval_steps: 150
eval_table_size:
eval_sample_packing: false
eval_max_new_tokens: 128
saves_per_epoch: 10
debug:
deepspeed: deepspeed_configs/zero1.json
weight_decay: 0.01

special_tokens:
chat_template: chatml
default_system_message: You are a helpful assistant.

Possible solution

Could it be because there are multiple samples in one input and the loss is not averaged over them?
Does this have any effect on training? let me know if any other information is needed

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.10/docker

axolotl branch-commit

main/decb66e17013

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

NanoCode012 commented 5 months ago

Hey, is this behavior consistent? Does this happen with other datasets or on another retry? How's your eval loss?

smhd001 commented 5 months ago

It is consistent through reruns and across different subsets of this dataset. However, I currently don't have access to test it with a totally different dataset my eval loss seems normal

NanoCode012 commented 4 months ago

You may try the datasets in the example configs for testing though they're a bit small.

axolotl-ai-cloud / axolotl