Compare jcommonsense qa prompts with question first vs last

As reported by this article, jcommonsense qa prompts that puts question last results in better performance. And, as you see the results in following table, I reproduced the performance jump with the prompts by changing only the order of question.

But currently, 0.3 and 0.6 prompts put the question last, and the others put it first. To ensure a fair model comparison, prompts should have the question in the same position.

What do you think if we add prompts that put the question last or update the current prompts to have the question last? If I missed something to experiment, please let me know.

Model	Acc of Question First (Prompt Ver)	Acc of Question Last (Prompt Ver)
japanese-stablelm-base-alpha-7b	0.5728 (v0.2.1)	0.7954 (v0.2.2)
open-calm-3b	0.3128 (v0.2.1)	0.7453 (v0.2.2)
ELYZA-japanese-Llama-2-7b	0.7516 (v0.2.1)	0.7730 (v0.2.2)
llama2-7b-chat	0.5952 (v0.3.2)	0.5559 (v0.3)
japanese-stablelm-instruct-alpha-7b	0.5898 (v0.3.2)	0.8222 (v0.3)
rinna-japanese-gpt-neox-3.6b-instruction-ppo	0.4406 (v0.4)	0.5934 (v0.4.2)
rinna-bilingual-gpt-neox-4b-instruction-ppo	0.4879 (v0.5)	0.5237 (v0.5.2)
llama2-7b-chat	0.6667 (v0.6.2)	0.613 (v0.6)

Stability-AI / lm-evaluation-harness

Compare jcommonsense qa prompts with question first vs last #113