OpenLMLab / LEval

[ACL'24 Oral] Data and code for L-Eval, a comprehensive long context language models evaluation benchmark
GNU General Public License v3.0
314 stars 13 forks source link

Update README.md #5

Closed tonysy closed 11 months ago

tonysy commented 11 months ago

Hi, thanks for the great works. This is OpenCompass team. We have supported L-Eval in OpenCompass.

We believe this bench will significantly help the exploration of long-context in LLM.

ChenxinAn-fdu commented 11 months ago

Woww, that is cool!!

But after checking L-Eval in OpenCompass, I find that there are some crucial features of L-Eval seem to be omitted in OpenCompass:

For closed-ended tasks:

Ensuring fair comparison for open-ended tasks.

Evaluating open-ended tasks for LLMs is somewhat harder. As we all know ngram metrics like rouge and F1 have serious bias. For example, LLMs usually generate the answer with a long CoT. If the ground truth has only one word, the F1 score will strongly decrease despite the answer being correct.

Thank you again for your great work! I am glad to merge this PR if these features are added to OpenCompass~

philipwangOvO commented 11 months ago

Woww, that is cool!!

But after checking L-Eval in OpenCompass, I find that there are some crucial features of L-Eval seem to be omitted in OpenCompass:

For closed-ended tasks:

* Coursera usually has multiple correct options, but the code in OpenCompass only considers the first capital in the predicted answer.

Ensuring fair comparison for open-ended tasks.

Evaluating open-ended tasks for LLMs is somewhat harder. As we all know ngram metrics like rouge and F1 have serious bias. For example, LLMs usually generate the answer with a long CoT. If the ground truth has only one word, the F1 score will strongly decrease despite the answer being correct.

* Since LLMs are tested in a zero-shot setting which means, unlike the previous supervised setting,  they usually cannot fit the target length distribution.  Comparing generated contents at different granularities is hard and unfair. So,  we suggest adding a length instruction like `we need a  {len(ground truth)} summary` to reduce the length bias.

* In L-Eval, we mainly depend on GPT4 evaluation to evaluate open-ended tasks.  Since we cannot feed the long input to the evaluator, the evaluation process mainly depends on the reference answer.  We also make some effort to prompt design based on experiments.

Thank you again for your great work! I am glad to merge this PR if these features are added to OpenCompass~

Hi, thanks for the great works.

We have refactored L-Eval in OpenCompass recently, in which a length instruction and GPT evaluation are supported in open-ended tasks.

For closed-ended tasks like Coursera, the code in OpenCompass considers not only the first capital, but also multiple options following the first capital in the predicted answer.

Thank you again for you great work and precious comments on our code! Please let us know if you had further questions~

ChenxinAn-fdu commented 11 months ago

Thank you so much for considering our advice! We are preparing a major update for L-Eval including two new datasets annotated from scratch. I have added the content about OpenCompass in this PR in the new Readme file.