THUDM / LongBench

[ACL 2024] LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
MIT License
672 stars 53 forks source link

Evaluation mechanism update #82

Open cizhenshi opened 3 weeks ago

cizhenshi commented 3 weeks ago

Currently, many evaluations of long text models reference LongBench results. However, n-gram based metrics do not truly reflect the quality of responses. Many papers have adopted the method of using GPT-4o for scoring. Could you provide an official version of the GPT-4o scoring code to standardize the 4o scoring across various evaluations and make the results more comparable?

bys0318 commented 3 weeks ago

Great suggestion! I will update the code to support LLM-as-a-judge evaluation in the next few days.