THUDM / LongBench

[ACL 2024] LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
MIT License
620 stars 43 forks source link

Llama2-7B-chat-4k测试出来结果不一样 #66

Closed PengWenChen closed 2 months ago

PengWenChen commented 3 months ago

Reopen issue #55 嗨 @bys0318 (@slatter666) 您好~ 我嘗試跑 Llama2-7B-chat-4k(https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) 出來的結果也跟您 leaderboard 上的成績不同,差異蠻大的。 我的執行環境不能連網,因此跟您的 pred.py 唯一的差異是 loading data & model & tokenizer from local 而已。 想請問成績為何有如此大的差異呢? 謝謝 我跑出來的成績 (seed 使用 pred.py 原始的 seed-42): { "narrativeqa": 14.57, "qasper": 6.6, "multifieldqa_en": 3.65, "multifieldqa_zh": 4.29, "hotpotqa": 4.27, "2wikimqa": 5.67, "musique": 1.3, "dureader": 15.71, "gov_report": 24.53, "qmsum": 16.13, "multi_news": 2.41, "vcsum": 0.03, "trec": 68.0, "triviaqa": 88.59, "samsum": 41.38, "lsht": 19.75, "passage_count": 0.5, "passage_retrieval_en": 3.0, "passage_retrieval_zh": 0.0, "lcc": 66.64, "repobench-p": 60.06 }

Copy results from github leaderboard: { "narrativeqa": 18.7, "qasper": 19.2, "multifieldqa_en": 36.8, "multifieldqa_zh": 11.9, "hotpotqa": 25.4, "2wikimqa": 32.8, "musique": 9.4, "dureader": 5.2, "gov_report": 27.3, "qmsum": 20.8, "multi_news": 25.8, "vcsum": 0.2, "trec": 61.5, "triviaqa": 77.8, "samsum": 40.7, "lsht": 19.8, "passage_count": 2.1, "passage_retrieval_en": 9.8, "passage_retrieval_zh": 0.5, "lcc": 52.4, "repobench-p": 43.8 }

我的資料是下載您README中提供的link:https://huggingface.co/datasets/THUDM/LongBench/resolve/main/data.zip load data by data = [json.loads(line for line in open(path, "r", encoding="utf-8")] load tokenizer model by tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtyp=torch.bfloat16).to(device)

BlackieMia commented 3 months ago

我也复现了类似的结果I got the similar result of yours.

PengWenChen commented 2 months ago

Sorry I just found out I accidentally use llama2-7B model instead of llama2-7B-chat model. The chat version model's scores I run is: { "narrativeqa": 18.82, "qasper": 23.65, "multifieldqa_en": 36.52, "multifieldqa_zh": 10.59, "hotpotqa": 26.4, "2wikimqa": 31.85, "musique": 7.76, "dureader": 5.2, "gov_report": 26.56, "qmsum": 21.28, "multi_news": 26.3, "vcsum": 0.18, "trec": 65.0, "triviaqa": 83.17, "samsum": 41.0, "lsht": 18.75, "passage_count": 1.57, "passage_retrieval_en": 7.5, "passage_retrieval_zh": 9.5, "lcc": 59.04, "repobench-p": 52.91 } I think it's pretty closed to the numbers of those on the leaderboard.