OpenLMLab / LEval

[ACL'24 Outstanding] Data and code for L-Eval, a comprehensive long context language models evaluation benchmark
GNU General Public License v3.0
349 stars 14 forks source link

questions on table 2 #3

Closed freshbirdDD closed 12 months ago

freshbirdDD commented 1 year ago

How can llama-7b-2k get only 3.63% acc on task coursera, as coursera is multiple-choice question -- 25% acc for random guess??

ChenxinAn-fdu commented 1 year ago

HI!! Thank you for your question. As Coursera has multiple correct answers (e.g., A, B, and C are all correct ), we can not simply take the first capital from the generated results as the final answer. The Llama-7b-2k model hasn't been fine-tuned on instructions, thus it may struggle to comprehend instructions such as 'select multiple correct options'. The generated results tend to be worse than random guessing a single Capital. For other tasks with a single correct option, we can take the first capital and the results are better. We will explain this in our next version. Thank you so much.

freshbirdDD commented 1 year ago

@ChenxinAn-fdu Thank you for your answer. There is another question, as you said "The Llama-7b-2k model hasn't been fine-tuned on instructions", does it mean that LEval is not suitable for "raw model", which hasn't been fine-tuned on instructions

ChenxinAn-fdu commented 1 year ago

Based on our experiments, raw models (like Llama, Llama2) significantly lag behind their IFT version (vicuna, Llama2-chat). My suggestion is further fine-tuning your model on sharegpt or Alpaca after pertaining 🤔.