Whenever I use QLoRA to train LLama/LLama 2 on an instruction-tuning dataset like Dolly or Alpaca I get a periodically oscillating training loss

ritabratamaiti commented 1 year ago

Is this behavior normal/acceptable? Why does it happen?

bqcao commented 1 year ago

I have similar sawtooth shape loss on Alpaca data, excerpt of my training output log is here:

{'loss': 1.5872, 'learning_rate': 1e-06, 'epoch': 0.01} {'loss': 1.237, 'learning_rate': 1e-06, 'epoch': 0.02} {'loss': 1.4684, 'learning_rate': 1e-06, 'epoch': 0.04} {'loss': 2.1779, 'learning_rate': 1e-06, 'epoch': 0.05} {'loss': 3.357, 'learning_rate': 1e-06, 'epoch': 0.06} {'loss': 1.5047, 'learning_rate': 1e-06, 'epoch': 0.07} {'loss': 1.2749, 'learning_rate': 1e-06, 'epoch': 0.08} {'loss': 1.477, 'learning_rate': 1e-06, 'epoch': 0.1} {'loss': 2.1822, 'learning_rate': 1e-06, 'epoch': 0.11} {'loss': 3.2731, 'learning_rate': 1e-06, 'epoch': 0.12} {'loss': 1.5442, 'learning_rate': 1e-06, 'epoch': 0.13} {'loss': 1.2816, 'learning_rate': 1e-06, 'epoch': 0.14} {'loss': 1.4423, 'learning_rate': 1e-06, 'epoch': 0.16} {'loss': 2.1455, 'learning_rate': 1e-06, 'epoch': 0.17} {'loss': 3.2909, 'learning_rate': 1e-06, 'epoch': 0.18} {'loss': 1.6531, 'learning_rate': 1e-06, 'epoch': 0.19} {'loss': 1.2675, 'learning_rate': 1e-06, 'epoch': 0.2}

Does that mean it is wrong trending as far as fine-tuning is concerned? Or what loss should be the key indicator for fine-tuning on Alpaca?

Thanks!

BTW, @ritabratamaiti , how did you get the above plot?

BugReporterZ commented 1 year ago

This might be due to the "group by length" option, try disabling it.

--group_by_length [GROUP_BY_LENGTH]
    Group sequences into batches with same length. Saves memory and speeds up training considerably. (default: True)

vincentmin commented 1 year ago

@BugReporterZ Could you explain the reasoning for why group_by_length may be causing this issue?

BugReporterZ commented 1 year ago

It appears to group training examples in length-ordered chunks, and the longer training examples at the start of these chunks will show a higher loss. I also recall reading elsewhere that it can cause an "oscillating" training loss curve, which is consistent with what you're seeing, maybe it was this comment by artidoro:

https://github.com/artidoro/qlora/issues/84#issuecomment-1572408347

ritabratamaiti commented 1 year ago

@bqcao This is from weights and biases (wandb); I set it up to visualize training.

@BugReporterZ I see! Thanks for the explanation. Has there been any work on incorporating QLoRA with SFTTrainer?

BugReporterZ commented 1 year ago

@ritabratamaiti I'm not aware of efforts in that regard, unfortunately.

vincentmin commented 1 year ago

@ritabratamaiti Yes, QLoRA is supported by SFTTrainer. You can use this example script and set load_in_4bit=True and use_peft=True. https://github.com/lvwerra/trl/blob/main/examples/scripts/sft_trainer.py See this blog for more details: https://huggingface.co/blog/4bit-transformers-bitsandbytes

@BugReporterZ thanks for the explanation.

ritabratamaiti commented 1 year ago

Thanks @BugReporterZ and @vincentmin

bqcao commented 1 year ago

Thanks @BugReporterZ ! Yes indeed, after disabled group_by_length, I don't see the sawtooth shape anymore. Appreciate @ritabratamaiti and @vincentmin as well!

usmanxia commented 1 year ago

@BugReporterZ What is the impact on training if we disable group_by_lenght? Is it comparable to it being set to true and only gain is saving memory?

BugReporterZ commented 1 year ago

I haven't investigated that in detail. I have always left that enabled because the eval loss curve didn't seem to be affected. You could refer to the Transformers documentation for what it does (same as what was relayed by Artidoro in the comment I linked earlier):

https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.group_by_length

group_by_length (bool, optional, defaults to False) — Whether or not to group together samples of roughly the same length in the training dataset (to minimize padding applied and be more efficient). Only useful if applying dynamic padding.

usmanxia commented 1 year ago

Got it, thank you

artidoro / qlora

Whenever I use QLoRA to train LLama/LLama 2 on an instruction-tuning dataset like Dolly or Alpaca I get a periodically oscillating training loss #228