allenai / reward-bench

RewardBench: the first evaluation tool for reward models.
https://huggingface.co/spaces/allenai/reward-bench
Apache License 2.0
375 stars 47 forks source link

Add markdown / LaTeX tables #10

Closed ljvmiranda921 closed 8 months ago

ljvmiranda921 commented 8 months ago

How to use

# Will print things out in markdown. Use --render_latex to print LaTeX
python -m analysis.get_benchmark_results
# Save outputs as a CSV file in a given directory
python -m analysis.get_benchmark_results --output_dir test/
HERM - Overview
\begin{tabular}{lllllllllllllllllllll}
\toprule
model & average & alpacaeval-easy & alpacaeval-hard & alpacaeval-length & hep-cpp & hep-go & hep-java & hep-js & hep-python & hep-rust & llmbar-adver-GPTInst & llmbar-adver-GPTOut & llmbar-adver-manual & llmbar-adver-neighbor & llmbar-natural & mt-bench-easy & mt-bench-hard & mt-bench-med & refusals-dangerous & refusals-offensive \\
\midrule
openbmb/UltraRM-13b & 0.72 & 0.99 & 0.98 & 0.98 & 0.81 & 0.85 & 0.9 & 0.85 & 0.87 & 0.87 & 0.47 & 0.53 & 0.41 & 0.47 & 0.81 & 0.96 & 0.92 & 0.92 & 0.06 & 0.11 \\
berkeley-nest/Starling-RM-7B-alpha & 0.72 & 0.94 & 1.0 & 0.73 & 0.75 & 0.81 & 0.85 & 0.8 & 0.77 & 0.79 & 0.34 & 0.45 & 0.37 & 0.34 & 0.73 & 0.89 & 0.81 & 0.82 & 0.57 & 0.84 \\
stabilityai/stablelm-zephyr-3b & 0.64 & 0.81 & 0.9 & 0.98 &  &  &  &  &  &  & 0.21 & 0.4 & 0.48 & 0.76 & 0.73 & 0.93 & 0.73 & 0.85 & 0.13 & 0.46 \\
llm-blender/PairRM-hf & 0.63 & 0.91 & 0.96 & 0.69 & 0.61 & 0.71 & 0.65 & 0.6 & 0.63 & 0.66 & 0.3 & 0.62 & 0.5 & 0.43 & 0.78 & 0.93 & 0.76 & 0.88 & 0.15 & 0.11 \\
OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5 & 0.63 & 0.92 & 0.87 & 0.78 &  &  &  &  &  &  & 0.43 & 0.49 & 0.52 & 0.39 & 0.67 & 0.89 & 0.62 & 0.85 & 0.35 & 0.35 \\
OpenAssistant/oasst-rm-2-pythia-6.9b-epoch-1 & 0.6 & 0.97 & 0.95 & 0.94 &  &  &  &  &  &  & 0.23 & 0.4 & 0.26 & 0.24 & 0.69 & 0.86 & 0.62 & 0.85 & 0.13 & 0.72 \\
OpenAssistant/reward-model-deberta-v3-large-v2 & 0.6 & 0.96 & 0.68 & 0.99 & 0.66 & 0.69 & 0.45 & 0.97 & 0.5 & 0.62 & 0.1 & 0.0 & 0.0 & 0.29 & 0.88 & 1.0 & 0.43 & 1.0 & 0.54 & 0.68 \\
weqweasdas/hh_rlhf_rm_open_llama_3b & 0.54 & 0.78 & 0.89 & 0.71 &  &  &  &  &  &  & 0.28 & 0.45 & 0.24 & 0.39 & 0.69 & 0.82 & 0.57 & 0.78 & 0.22 & 0.21 \\
stanfordnlp/SteamSHP-flan-t5-xl & 0.5 & 0.91 & 0.95 & 0.69 & 0.46 & 0.46 & 0.46 & 0.55 & 0.43 & 0.5 & 0.25 & 0.38 & 0.39 & 0.22 & 0.65 & 0.79 & 0.68 & 0.62 & 0.01 & 0.01 \\
\bottomrule
\end{tabular}

HERM - Detailed
\begin{tabular}{lllllll}
\toprule
model & average & alpacaeval & mt-bench & llmbar & refusals & hep \\
\midrule
berkeley-nest/Starling-RM-7B-alpha & 0.74 & 0.89 & 0.84 & 0.45 & 0.7 & 0.8 \\
openbmb/UltraRM-13b & 0.68 & 0.98 & 0.93 & 0.54 & 0.08 & 0.86 \\
stabilityai/stablelm-zephyr-3b & 0.64 & 0.9 & 0.84 & 0.52 & 0.3 &  \\
OpenAssistant/reward-model-deberta-v3-large-v2 & 0.64 & 0.88 & 0.81 & 0.25 & 0.61 & 0.65 \\
OpenAssistant/oasst-rm-2-pythia-6.9b-epoch-1 & 0.63 & 0.95 & 0.78 & 0.36 & 0.42 &  \\
OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5 & 0.62 & 0.86 & 0.79 & 0.5 & 0.35 &  \\
llm-blender/PairRM-hf & 0.6 & 0.85 & 0.86 & 0.53 & 0.13 & 0.64 \\
weqweasdas/hh_rlhf_rm_open_llama_3b & 0.54 & 0.79 & 0.72 & 0.41 & 0.22 &  \\
stanfordnlp/SteamSHP-flan-t5-xl & 0.48 & 0.85 & 0.7 & 0.38 & 0.01 & 0.48 \\
\bottomrule
\end{tabular}

Pref Sets - Overview
\begin{tabular}{lllllllllll}
\toprule
model & average & anthropic & anthropic_hhh & mtbench_gpt4 & mtbench_human & pku_better & pku_safer & shp & summarize & summarize_prompted \\
\midrule
llm-blender/PairRM-hf & 0.65 & 0.6 & 0.83 & 0.72 & 0.65 & 0.57 & 0.5 & 0.59 & 0.71 & 0.71 \\
OpenAssistant/reward-model-deberta-v3-large-v2 & 0.64 & 0.69 &  &  &  & 0.44 & 0.61 & 0.61 & 0.76 & 0.72 \\
stanfordnlp/SteamSHP-flan-t5-xl & 0.63 & 0.57 & 0.65 & 0.77 & 0.66 & 0.66 & 0.46 & 0.8 & 0.53 & 0.53 \\
OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5 & 0.6 & 0.63 &  &  &  & 0.48 & 0.55 & 0.68 & 0.61 & 0.62 \\
\bottomrule
\end{tabular}
ljvmiranda921 commented 8 months ago

Merging this! Saw the table in Overleaf. Let me clean that one up later today or tomorrow afternoon. Oof I didn't see the comment, will update this PR to include instructions in the README.

natolambert commented 8 months ago

Yeah not a huge rush on the tables, it'll just make our life easier :)