OpenBMB / InfiniteBench

Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718
MIT License
244 stars 19 forks source link

Mismatch for longbook_qa_eng #21

Open xuandif-cmu opened 3 weeks ago

xuandif-cmu commented 3 weeks ago

Are the GPT4 results evaluated on a different set of longbook_qa_eng? The 'ground_truth' fields in results/gpt4/preds_longbook_qa_eng.jsonl don't seem match with ground_truth in results/chatglm3/preds_longbook_qa_eng.jsonl

tuantuanzhang commented 3 weeks ago

We have revised the En.QA task. And those two models are evaluated at different task versions