Stability-AI / lm-evaluation-harness

A framework for few-shot evaluation of autoregressive language models.
MIT License
145 stars 47 forks source link

Add prompt version `0.2.1` for JCommonsenseQA #104

Closed mkshing closed 1 year ago

mkshing commented 1 year ago

Background

In principle, "base" models (trained as just language modeling and without specific prompt format) should be evaluated with prompt version 0.2. But, we were reported 0.3 outperformed 0.2, which is weird.

So, we did comparison between 0.2 and 0.3 for some models. (thank you @mrorii !) And, we found using 0.3 increased scores for all base models in JCommonsenseQA and JNLI.

Summary

JCommonsenseQA is a question answering task given 5 choices. In 0.2, the prompt looks like below. (reference link)

質問と回答の選択肢を入力として受け取り、選択肢から回答を選択してください。なお、回答は選択肢の番号(例:0)でするものとします。 

質問:街のことは?
選択肢:0.タウン,1.劇場,2.ホーム,3.ハウス,4.ニューヨークシティ
回答:

The prompt encourages to answer by "index" rather than the text itself. But, the targets are actually texts. So, I assume models were messed up somehow by this gap. (code)

Solution

Model # of shots prompt version acc
elyza/ELYZA-japanese-Llama-2-7b 0 0.2 31.64
elyza/ELYZA-japanese-Llama-2-7b 0 0.3 38.96
elyza/ELYZA-japanese-Llama-2-7b 0 0.21 (NEW!) 45.49
matsuo-lab/weblab-10b 0 0.2 23.32
matsuo-lab/weblab-10b 0 0.3 42.27
matsuo-lab/weblab-10b 0 0.21 (NEW!) 25.47
mkshing commented 1 year ago

Although all scores of 0.2.1 didn't outperform 0.3, we confirmed 0.2.1 is way better than at least 0.2 and 0.3 for base models. So, I will merge this PR.