allenai / reward-bench

RewardBench: the first evaluation tool for reward models.
https://huggingface.co/spaces/allenai/reward-bench
Apache License 2.0
375 stars 47 forks source link

multi gpu inference with run_rm.py #95

Closed SeungoneKim closed 3 months ago

SeungoneKim commented 6 months ago

Hello Nathan,

Thank you for this valuable resource! I strongly think that we needed more standardized benchmarks to evaluate reward/evaluator models.

I think submit_eval_jobs.py (using AI2's beaker) supports multi gpu inference but run_rm.py doesn't at the moment. I was wondering if this intended (correct me if I'm wrong)!

Best, Seungone

natolambert commented 6 months ago

Hey @SeungoneKim -- we just haven't needed it yet (the biggest classifiers are 34B). Happy to add it.

run_dpo.py works nicely with 2,4,6,8 GPUs. That's what it's included. Lmk if you want to open a PR :)

SeungoneKim commented 6 months ago

Thanks for your response @natolambert!

I was trying to test generative reward modeling (with GPT-4, Prometheus, Auto-J) and it seems like run_dpo.py has a slightly different functionality than what I need.

Considering that generative RMs require generating a CoT-ish feedback before their scoring decision, I think it would be best to integrate vllm and add an additional run_generative_rm.py code. Users could add on additional generative rms by implementing the code for parsing the output(reward).

If this makes sense to you, I'll leave a pull request of this and try to maintain the style of the code as similar to run_rm.py!

natolambert commented 6 months ago

@SeungoneKim generative RM's (via API) are being added in #86, but adding the full generation thing is another can of worms. I agree with your path, I just worry a bit about complexity. It's prolly worth having though.

The API implementation should be closer to what you want to build off of.

Here are preliminary results

Claude results:
Haiku {‘Chat’: 0.9273743016759777, ‘Chat Hard’: 0.5197368421052632, ‘Safety’: 0.8210275184275184, ‘Reasoning’: 0.7060194658154636}
Sonnet {‘Chat’: 0.9343575418994413, ‘Chat Hard’: 0.5657894736842105, ‘Safety’: 0.8367826605826606, ‘Reasoning’: 0.6907005374583948}
Opus {‘Chat’: 0.946927374301676, ‘Chat Hard’: 0.6030701754385965, ‘Safety’: 0.8905447525447526, ‘Reasoning’: 0.7868223795492989}
(reminder) OpenAI results:
GPT 3.5 {‘Chat’: 0.9217877094972067, ‘Chat Hard’: 0.4451754385964912, ‘Safety’: 0.6229577395577396, ‘Reasoning’: 0.5912315163420091}
GPT 4.Turbo {‘Chat’: 0.952513966480447, ‘Chat Hard’: 0.743421052631579, ‘Safety’: 0.8719219375219376, ‘Reasoning’: 0.8692366453865881}