Closed cat-state closed 1 year ago
@jagilley is doing this with a prompt engineered reward model.
@jagilley is doing this with a prompt engineered reward model.
ohh I actually meant one with learning a reward model, I'll clarify the title
Great. Folks at ScaleAI are doing this.
I sent them the issue. Daniel, happy to assign scale folks to this issue.
Create an example showing reward modeling. This could use a synthetic reward source artificially limited, or the HHH Anthropic data (already on the Stability cluster). More ideas for tasks: https://github.com/CarperAI/trlx/issues/13#issuecomment-1273632021 (cc @haileyschoelkopf)