ContextualAI / HALOs

A library with extensible implementations of DPO, KTO, PPO, ORPO, and other human-aware loss functions (HALOs).
https://arxiv.org/abs/2402.01306
Apache License 2.0
712 stars 40 forks source link

Is there a problem with training? #13

Closed Pattaro closed 6 months ago

Pattaro commented 7 months ago

Initially, I used KTO for training, and the loss did not converge at all, as shown in the following training result graph. 725bd748-1d85-4d80-90ff-bd2275031467

Later, based entirely on llama7b and hh data, I used the script you provided exactly: python train.py loss=sft model=llama7b datasets=[hh] exp_name=llama7b_sft mode=train ++cache_dir=/data/models, and the training result graph is as follows: image (1) image

The only difference in the SFT training from yours is that I set use_flash_attention to false.

kawine commented 7 months ago

For SFT, here are the train/validation loss curves I have for Llama 7B on all the data we finetuned on. Since by default we're only tuning for 1 epoch, the drop in training loss is not going to be huge (since the model is constantly seeing new data). I would look at the loss for the validation data though -- that should be going down monotonically since it's being calculated on the same held-out data.

Screen Shot 2024-03-03 at 6 34 06 PM

For KTO, you seem to have ended the training early, but it looks like the rewards_train/margin is slowly going up? I would smooth out the data before visualizing it or maybe use weights & biases. Here's what the smoothed reward_margin curves looks like for me (averaged across 100 steps).

Screen Shot 2024-03-03 at 6 39 41 PM
Pattaro commented 7 months ago

Thank you for your reply! In response to your suggestion, I have smoothed the curves of several results respectively. The sft curve is indeed in line with your statement, but the kto curve still reflects the increase in loss and the decrease in margin (average every 40 steps), and due to some limitations I only have 480 samples and I am not terminating the training early. loss_train rewards_train_margins

kawine commented 7 months ago

A couple ideas:

  1. Do you have even numbers of desirable (i.e., positive) and undesirable (i.e., negative) data? If not, I would set desirable_weight and undesirable_weight such that (desirable_weight num positive) / (undesirable_weight num negative) is in the range [1, 1.33]. You can override the default setting for both these weights by appending ++loss.desirable_weight=[some value] and ++loss.undesirable_weight=[some value] on the command line.

  2. If that doesn't help, I would run KTO for 2 or 3 epochs by overriding with ++n_epochs=[some value] on the command line. Unfortunately, 480 samples is of a pretty small scale, so you might run into overfitting issues, but the training loss should at least go down.

Pattaro commented 7 months ago

Thank you for your reply! In response to your suggestion 1, I set the number of positive samples to 241 and the number of negative samples to 241, making a 1:1 ratio between the two. In response to your suggestion 2,I will go to try。