ContextualAI / HALOs

A library with extensible implementations of DPO, KTO, PPO, ORPO, and other human-aware loss functions (HALOs).
https://arxiv.org/abs/2402.01306
Apache License 2.0
712 stars 40 forks source link

Request for details and assistance on PPO Experiments with SFT+PPO training #16

Open roshansridhar opened 5 months ago

roshansridhar commented 5 months ago

Hello Developers,

Firstly, I would like to thank you for the excellent work on this repository and for sharing the plots on other issues. I'm currently utilizing your library to train a model using sft+ppo, and I've successfully replicated the sft experiment as per the results shared ContextualAI/HALOs/issues/13.

However, I'm experiencing negligible improvement with the PPO part of the training. Could you provide the details of your PPO experiments? I noted in a previous comment that the preferential tuning was run significantly longer, so I adjusted the PPO epochs to 3 in my experiments. Are these adjustments in line with what was done in your experiments? Additionally, could you elaborate on how and when to decide which checkpoint to use for downstream tasks, especially for PPO, DPO, and KTO scenarios?

Link to my plots: l7b_ppo_0419 Screenshot 2024-04-22 at 11 11 32 AM

Here are the commands I used for my experiments:

Additional Query:

Thank you for any guidance or insights you can provide.

kawine commented 5 months ago

sorry for the late reply @roshansridhar . I don't have access to the wandb logs anymore, but my coauthor should -- @xwinxu can you paste the plots from the last llama7b+ppo run, as well as the json dump of the config?

However, I'm experiencing negligible improvement with the PPO part of the training.

your charts look pretty good to me. the policy loss is declining over time. the critic declines first but is flat afterwards -- this is expected because the value network is pretty small (2-layer MLP), so its capacity is limited, and the irreducible error will be pretty large. mean reward is going up over time as well. the only suspicious part is the exploding loss/seqratio, but you can probably fix this by using ++loss.cliprange=[value smaller than 0.5].

so I adjusted the PPO epochs to 3 in my experiments

IIRC, we did PPO for 1 epoch as well, the same as all the other alignment methods

could you elaborate on how and when to decide which checkpoint to use for downstream tasks, especially for PPO, DPO, and KTO scenarios?

sure. for KTO, we basically used the same hyperparam settings recommended in the DPO paper, so no decision needed to be made there. for PPO, hyperparams were chosen based on how whether training was stable, rewards went up over time, and the GPT-4-judged winrate against the baseline went up over time

When conducting sft training, it calculates train and validation losses using train dataset splits. If I use the same dataset for ppo, how can I ensure that I am not retraining on the train split inadvertently?

doing sft on the positive examples in a preference/feedback dataset is fairly common practice, as done in the DPO paper. this isn't something to worry about IMO, but you can always just do ppo/dpo/kto without SFT or just do SFT on a separate dataset. all you need to do is change the ++model.load_from field.

Based on your plots and results shared, am I correct in understanding that you had a batch size of 32 and conducted 200k steps of sft training,

bs=32 sounds right. @xwinxu can you double-check how many SFT steps there were for llama7b on [shp,hh,oasst]?