Collect together settings and commonly used reward models and evaluations. These RMs can be used for training time eval, but we would probably also want to use multiple-choice evals too
Ideas for RMs:
Our HH RMs
Sentiments
SteamSHP?
OpenAI API GPT3.5/4 (probably only usable for test time?)
Multiple Choice evals:
Anthropics model generated evals
OpenAIs new evals
lm-evaluation-harness supported evals (would need to add ability to run model ourselves for thaT?)
Ranking evals: Provide a pre-ranked set of responses and have model order them from most to least aligned
🚀 The feature, motivation, and pitch
Collect together settings and commonly used reward models and evaluations. These RMs can be used for training time eval, but we would probably also want to use multiple-choice evals too
Ideas for RMs:
Multiple Choice evals:
lm-evaluation-harness
supported evals (would need to add ability to run model ourselves for thaT?)