Stability-AI / lm-evaluation-harness

A framework for few-shot evaluation of autoregressive language models.
MIT License
145 stars 47 forks source link

Compare jcommonsense qa prompts with question first vs last #113

Open kumapo opened 1 year ago

kumapo commented 1 year ago

As reported by this article, jcommonsense qa prompts that puts question last results in better performance. And, as you see the results in following table, I reproduced the performance jump with the prompts by changing only the order of question.

But currently, 0.3 and 0.6 prompts put the question last, and the others put it first. To ensure a fair model comparison, prompts should have the question in the same position.

What do you think if we add prompts that put the question last or update the current prompts to have the question last? If I missed something to experiment, please let me know.

Model Acc of Question First (Prompt Ver) Acc of Question Last (Prompt Ver)
japanese-stablelm-base-alpha-7b 0.5728 (v0.2.1) 0.7954 (v0.2.2)
open-calm-3b 0.3128 (v0.2.1) 0.7453 (v0.2.2)
ELYZA-japanese-Llama-2-7b 0.7516 (v0.2.1) 0.7730 (v0.2.2)
llama2-7b-chat 0.5952 (v0.3.2) 0.5559 (v0.3)
japanese-stablelm-instruct-alpha-7b 0.5898 (v0.3.2) 0.8222 (v0.3)
rinna-japanese-gpt-neox-3.6b-instruction-ppo 0.4406 (v0.4) 0.5934 (v0.4.2)
rinna-bilingual-gpt-neox-4b-instruction-ppo 0.4879 (v0.5) 0.5237 (v0.5.2)
llama2-7b-chat 0.6667 (v0.6.2) 0.613 (v0.6)