EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.57k stars 1.74k forks source link

Add task variants replicating Llama 1 / 2 evaluation numbers #1078

Open haileyschoelkopf opened 9 months ago

haileyschoelkopf commented 9 months ago

In some cases, Llama 1 (and 2, whose eval setups at times differ) paper results are not replicable by our implementations due to Meta’s custom undisclosed prompts or prepended task descriptions.

However, for some tasks like Triviaqa, we have successfully found setups / reverse engineered prompts. Where we have done this we should add documentation and variants of the tasks for ease of use.

afcruzs commented 6 months ago

@haileyschoelkopf Is this documented yet somewhere? Even if it's in an informal capacity (eg a branch, a Google doc etc)