KwanWaiChung / M4LE

Code for M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models
MIT License
22 stars 0 forks source link

Why dureader testSet Use Acc metric instead of Rouge #7

Closed jarheadjoe closed 2 months ago

jarheadjoe commented 10 months ago

When using Acc in dureader, gpt-3.5-16k's score is very low, this is my self test. image This is inconsistent with the results in the appendix of your paper

KwanWaiChung commented 2 months ago

Thank you for your suggestion. We have changed the metric to rouge instead in the updated version.