OpenLMLab / LEval

[ACL'24 Outstanding] Data and code for L-Eval, a comprehensive long context language models evaluation benchmark
GNU General Public License v3.0
349 stars 14 forks source link

Problems with the sci_fi evaluation #11

Closed sheryc closed 8 months ago

sheryc commented 8 months ago

Hi, thank you for the great work! I've found some problems with the sci_fi dataset.

  1. For loyalty and factuality tests, the instruction contains the number of the question (e.g., "2.\tThere exists a planet in Sirius star system that has gravity similar to Earth and signs of life but without a breathable atmosphere, True or False? Answer this question based on the world described in the document."). The "2.\t" is sometimes "Question2:", where the prompt would contain two "Question"s. Should the question number be removed from the prompt?
  2. For the factuality test, I found that the output is always a large portion of text instead of just True of False. It seems like the input prompt for the factuality test doesn't contain the "Please directly give answer without any additonal output or explanation" part. It would be more suitable if the prompt for the loyalty and factuality tests could be unified.

Also, there's a typo in the prompt "Please directly give answer without any additonal output or explanation." in https://github.com/OpenLMLab/LEval/blob/745aa8c5f0e3ef37010f9d7634f20ec30f017c01/Baselines/llama2-chat-test.py#L121. additonal -> additional.

ChenxinAn-fdu commented 8 months ago

Thanks so much for your valuable suggestions !!! I will fix this bug as soon as possible.

ChenxinAn-fdu commented 8 months ago

I have updated the code to fix the typo and unify the loyalty and actuality tests! Thank you again!

As for the first issue, changing the prompt will have an influence on the results and we are currently unable to re-run these baselines again and update the results in the paper. So I will keep the prompt unchanged in this repo. To my experience, this won't have a significant impact on the results. If it has in your case, you can also change fix it, and re-run the baselines to compare with your model.