CarperAI / trlx

A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)
MIT License
4.45k stars 470 forks source link

How to implement a conditional reward? #146

Closed mukhal closed 1 year ago

mukhal commented 1 year ago

I want my reward function to depend on the prompt used. Mainly, I want to fine-tune an LM for a conditional generation task e.g., summarization. It seems that the reward function expects only a list of model outputs. How can I access the prompt used for each output sample? Any suggestions?

maxreciprocate commented 1 year ago

Hello! The reward function currently expects the whole sample = prompt + output as input, so in the case of summarization you could split it over "TL;DR" to recover prompts, or in a general case you could find the largest common prefix against prepared prompts. However the latter might be too complex and there's been expressed interest previously to update the signature to reward_fn(samples, prompts, outputs) to make it easier for a general usage, so I will add this shortly.

James4Ever0 commented 1 year ago

In addition to the parameter allowing passing prompt-dependent reward function, I'm more interested in the way of implementing the reward function.

How is it done, to efficiently train a reward function out of a fine-tuned autoregressive model like GPT2 (cause it has seen much, and reusing existing parameters is easy by LoRA using OpenDelta in reference to #110 ), perceiving both prompts and answers, then generating the reward faithfully according to the score given by human?

What if the prompt is long because we want to do few-shot dialogs? Can the reward function handle infinite input length on both prompt and answer? What if we want to talk to the bot and preserve previous context just like what ChatGPT does, requiring passing all chat histories as prompt and desired answer as input?

Pretty sure such reward function is missing in this repo, which seems to be crucial and rare to find a proper implementation compared to PPO and RLHF on language models (which I have at least seen a few)

In my opinion, if failed to implement this with current models, we can just let human be part of the reward function, score the answers according to prompts and answers (though tiresome for sure).

James4Ever0 commented 1 year ago

To handle such wide range of tasks which ChatGPT is capable of, the reward function is either multimodal (merged with multiple metrics and models, fine-tuned according to human preference) or monolithic (must be huge, resource efficient so it could be trained from SFT or part of the RLHF model).

jon-tow commented 1 year ago

Hello, @James4Ever0 ! We do not plan on incorporating reward modeling into this repository. If you want to get a better idea of such fine-tuning (SFT + RMs) in practice, you should check out the amazing reward-modeling repo created by one of the authors of trlx 😄.

mukhal commented 1 year ago

@reciprocated is there a way now to use a reward function of the form reward_fn(samples, prompts, outputs)?