Open xuandif-cmu opened 3 months ago
Are the GPT4 results evaluated on a different set of longbook_qa_eng? The 'ground_truth' fields in results/gpt4/preds_longbook_qa_eng.jsonl don't seem match with ground_truth in results/chatglm3/preds_longbook_qa_eng.jsonl
longbook_qa_eng
We have revised the En.QA task. And those two models are evaluated at different task versions
Are the GPT4 results evaluated on a different set of
longbook_qa_eng
? The 'ground_truth' fields in results/gpt4/preds_longbook_qa_eng.jsonl don't seem match with ground_truth in results/chatglm3/preds_longbook_qa_eng.jsonl