In principle, "base" models (trained as just language modeling and without specific prompt format) should be evaluated with prompt version 0.2. But, we were reported 0.3 outperformed 0.2, which is weird.
So, we did comparison between 0.2 and 0.3 for some models. (thank you @mrorii !) And, we found using 0.3 increased scores for all base models in JCommonsenseQA and JNLI.
Summary
JCommonsenseQA is a question answering task given 5 choices. In 0.2, the prompt looks like below. (reference link)
The prompt encourages to answer by "index" rather than the text itself. But, the targets are actually texts. So, I assume models were messed up somehow by this gap. (code)
Solution
Introduced a new prompt version 0.2.1 for base models, which outperformed 0.2 and 0.3.
Although all scores of 0.2.1 didn't outperform 0.3, we confirmed 0.2.1 is way better than at least 0.2 and 0.3 for base models. So, I will merge this PR.
Background
In principle, "base" models (trained as just language modeling and without specific prompt format) should be evaluated with prompt version
0.2
. But, we were reported0.3
outperformed0.2
, which is weird.So, we did comparison between
0.2
and0.3
for some models. (thank you @mrorii !) And, we found using0.3
increased scores for all base models in JCommonsenseQA and JNLI.Summary
JCommonsenseQA is a question answering task given 5 choices. In 0.2, the prompt looks like below. (reference link)
The prompt encourages to answer by "index" rather than the text itself. But, the targets are actually texts. So, I assume models were messed up somehow by this gap. (code)
Solution
0.2.1
for base models, which outperformed0.2
and0.3
.hf-causal-experimental
andjcommonsenseqa-1.1-{prompt version}