Learnt Reward Modelling example

CarperAI / trlx

A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)

MIT License

4.5k stars 472 forks source link

Learnt Reward Modelling example #25

Closed cat-state closed 1 year ago

cat-state commented 2 years ago

Create an example showing reward modeling. This could use a synthetic reward source artificially limited, or the HHH Anthropic data (already on the Stability cluster). More ideas for tasks: https://github.com/CarperAI/trlx/issues/13#issuecomment-1273632021 (cc @haileyschoelkopf)

LouisCastricato commented 2 years ago

@jagilley is doing this with a prompt engineered reward model.

cat-state commented 2 years ago

@jagilley is doing this with a prompt engineered reward model.

ohh I actually meant one with learning a reward model, I'll clarify the title

LouisCastricato commented 2 years ago

Great. Folks at ScaleAI are doing this.

LouisCastricato commented 2 years ago

I sent them the issue. Daniel, happy to assign scale folks to this issue.