[Get WB Score on each domain] How to get WB Score on Info Seek/Creative/Code & Debug etc

allenai / WildBench

Benchmarking LLMs with Challenging Tasks from Real Users

https://huggingface.co/spaces/allenai/WildBench

Apache License 2.0

181 stars 28 forks source link

[Get WB Score on each domain] How to get WB Score on Info Seek/Creative/Code & Debug etc #18

Closed ludybupt closed 1 month ago

ludybupt commented 1 month ago

When running

bash evaluation/run_all_eval_batch.sh model_pretty_name
python src/openai_batch_eval/check_batch_status_with_model_name.py model_pretty_name
bash leaderboard/show_eval.sh

The WB_Elo score output, such as,

BUT the WB Score for Info Seek/Creative/Code & Debug did not output. We would like to evaluate the model privately. Is it free or does it require payment? If payment is required, how do we pay?

ludybupt commented 1 month ago

When running scripts locally like:
bash evaluation/run_all_eval_batch.sh model_pretty_name
python src/openai_batch_eval/check_batch_status_with_model_name.py model_pretty_name
bash leaderboard/show_eval.sh
The WB_Elo score output, such as,

BUT the WB Score for Info Seek/Creative/Code & Debug do not output as the LeaderBoard（https://huggingface.co/spaces/allenai/WildBench）. Is it free or does it require payment? If payment is required, how do we pay?

ludybupt commented 1 month ago

solved，thx