CarperAI / trlx

A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)
MIT License
4.48k stars 471 forks source link

How to attribute different rewards to parts of the same rollout with PPO? #79

Open paulbricman opened 1 year ago

paulbricman commented 1 year ago

This is related to #69 (which is why I phrased it in a similar way), but still feels a bit different.

Let's say the model generates a sequence of three related sentences (or paragraphs or tokens) after being prompted (i.e. the rollout). Is there a way to assign them different rewards individually instead of just one single aggregate reward, say based on different criteria? Perhaps I have a constant mass of reward I want to differentially assign to the several parts, but the sum is always constant. In the limit of generality, this would mean being able to assign specific reward values for each individual token/action in the rollout/trajectory. In this use case, the individual rewards can only be computed after the whole sequence of parts has been generated (i.e. you can't reward step 1 before generating step 3).

Is this possible with trlx? Would it require a custom orchestrator or is there a way to specify individual token rewards right away while keeping the standard structure? Is this even possible with PPO in the first place, or is there a fundamental misunderstanding on my part?

Thanks for building this!

LouisCastricato commented 1 year ago

The reward function returns a list of floats, not an aggregate float. Is there something more than this that you would want? Perhaps I am misunderstanding.

Dahoas commented 1 year ago

From the PPO perspective each token receives a reward given by the kl-distance of the fine-tuned model to the reference model plus the score provided by the user provided reward_fn for the last token. Perhaps the simplest option is generalize our reward_fn to provide a score per token instead of score per continuation. However I caution this could drastically slow down training time.

An alternate approximating approach may be to augment your prompt pipeline by adding newly generated continuations. For example if you want 2 separate rewards for an input prompt p you can first generate output s1, compute reward_fn(s1), update your prompt pipeline with p + s1 and then generate output s2 conditioned p + s1 and compute reward_fn(s2). Does this seem reasonable?

LouisCastricato commented 1 year ago

Yeah the latter suggestion was what I was getting at, I am actually basically using that solution for the ELO stuff I've been working on. I think it works quite well.

paulbricman commented 1 year ago

Thanks again for a super quick reply!

The reward function returns a list of floats, not an aggregate float. Is there something more than this that you would want? Perhaps I am misunderstanding.

@LouisCastricato, if I understand correctly, it returns a list of floats with one float per sample, no? I was looking for perhaps being able to return a list of list of floats, where the outside list would be associated with samples in a batch and the inside with tokens in a sample.

Perhaps the simplest option is generalize our reward_fn to provide a score per token instead of score per continuation. However I caution this could drastically slow down training time.

@Dahoas, at least having an option to do this would be pretty cool! Is the reason behind the efficiency hit having to compute reward multiple times or some other PPO-related thing?

Imagine a generated conversation between two people, where each say a number. Can I reward the tokens/actions involved in saying the larger number while penalizing the smaller one? Computing the reward sounds pretty trivial, but I'm not sure if applying token-specific rewards would for some reason mess up with training efficiency some other way.

An alternate approximating approach may be to augment your prompt pipeline by adding newly generated continuations. For example if you want 2 separate rewards for an input prompt p you can first generate output s1, compute reward_fn(s1), update your prompt pipeline with p + s1 and then generate output s2 conditioned p + s1 and compute reward_fn(s2). Does this seem reasonable?

@LouisCastricato, that sounds reasonable, but I mentioned the following before:

In this use case, the individual rewards can only be computed after the whole sequence of parts has been generated (i.e. you can't reward step 1 before generating step 3).

Similar to the "reward the (relatively) larger number" toy problem, I wouldn't be able to compute the reward for s1 before getting s2. This again sounds similar to the #69 request.

LouisCastricato commented 1 year ago

@dpaleka Tagging here