Closed sanghyuk-choi closed 4 months ago
Hey @sanghyuk-choi please open a pull request if it doesn't work with VLLM (our default local model tool) https://github.com/allenai/reward-bench/blob/6a5e0c4315b84beccfbbd99ca4878fda8ae31d56/scripts/run_generative.py#L125
Thank you @natolambert . I've just opened pull requests for this issue. see https://github.com/allenai/reward-bench/pull/159 and I would greatly appreciate it if you add our two models to the reward bench.
Due to slight difference in prompt templates and dtype of models since I left first comment, the NCSOFT/Llama-3-OffsetBias-RM-8B (reward model) score would be {"Chat": 0.9720670391061452, "Chat Hard": 0.8070175438596491, "Safety": 0.890087328887329, "Reasoning": 0.9054468816500246} and the NCSOFT/Llama-3-OffsetBias-8B (generative model) score would be {'Chat': 0.9217877094972067, 'Chat Hard': 0.8026315789473685, 'Safety': 0.8609156897156898, 'Reasoning': 0.7611856823266219}
you can use our default chat template.
We have opened two models: one is a reward model, and the other is a generative model with a fixed prompt template like Prometheus. Details are on the huggingface pages and our paper.
Our local result for the reward model is {'Chat': 0.9776536312849162, 'Chat Hard': 0.8201754385964912, 'Safety': 0.8839476307476307, 'Reasoning': 0.9244591313362799}. And generative model result is {"Chat": 0.9413407821229051, "Chat Hard": 0.8004385964912281, "Safety": 0.8622770094770095, "Reasoning": 0.7627134828395264}.
For the reward model, it's okay to evaluate without any additional code, but for the generative model, may I open a pull request for our model?