Closed Pattaro closed 6 months ago
For SFT, here are the train/validation loss curves I have for Llama 7B on all the data we finetuned on. Since by default we're only tuning for 1 epoch, the drop in training loss is not going to be huge (since the model is constantly seeing new data). I would look at the loss for the validation data though -- that should be going down monotonically since it's being calculated on the same held-out data.
For KTO, you seem to have ended the training early, but it looks like the rewards_train/margin is slowly going up? I would smooth out the data before visualizing it or maybe use weights & biases. Here's what the smoothed reward_margin curves looks like for me (averaged across 100 steps).
Thank you for your reply! In response to your suggestion, I have smoothed the curves of several results respectively. The sft curve is indeed in line with your statement, but the kto curve still reflects the increase in loss and the decrease in margin (average every 40 steps), and due to some limitations I only have 480 samples and I am not terminating the training early.
A couple ideas:
Do you have even numbers of desirable (i.e., positive) and undesirable (i.e., negative) data? If not, I would set desirable_weight
and undesirable_weight
such that (desirable_weight
num positive) / (undesirable_weight
num negative) is in the range [1, 1.33]. You can override the default setting for both these weights by appending ++loss.desirable_weight=[some value]
and ++loss.undesirable_weight=[some value]
on the command line.
If that doesn't help, I would run KTO for 2 or 3 epochs by overriding with ++n_epochs=[some value]
on the command line. Unfortunately, 480 samples is of a pretty small scale, so you might run into overfitting issues, but the training loss should at least go down.
Thank you for your reply! In response to your suggestion 1, I set the number of positive samples to 241 and the number of negative samples to 241, making a 1:1 ratio between the two. In response to your suggestion 2,I will go to try。
Initially, I used KTO for training, and the loss did not converge at all, as shown in the following training result graph.
Later, based entirely on llama7b and hh data, I used the script you provided exactly: python train.py loss=sft model=llama7b datasets=[hh] exp_name=llama7b_sft mode=train ++cache_dir=/data/models, and the training result graph is as follows:
The only difference in the SFT training from yours is that I set use_flash_attention to false.