[BUG] <title>Cannot reproduce Qwen1.5-7B base model's reported score 62.5 on gsm8k

StevenLau6 commented 1 month ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

I used the code eval/evaluate_gsm8k.py to evaluate Qwen1.5-7B base model downloaded from huggingface. The results shows that Qwen1.5-7B base got Acc: 0.4457922668 on gsm8k, which is much lower than the reported score 62.5 (https://huggingface.co/Qwen/Qwen2-7B). But the Qwen1.5-1.8B base got Acc: 0.382865807 which is similar to the reported score 38.4 (https://huggingface.co/Qwen/Qwen2-1.5B)

Another strange thing is that the Qwen1.5-7B-Chat model got 60.3 on gsm8k (https://huggingface.co/Qwen/Qwen2-7B-Instruct), which is lower than the base model.

Hope to know if there is any typo or the base model is finetuned with the gsm8k training set before the evaluation on the test set?

期望行为 | Expected Behavior

reproduce Qwen1.5-7B base model's reported score 62.5 on gsm8k

复现方法 | Steps To Reproduce

I downloaded the gsm8k test set from https://github.com/openai/grade-school-math/tree/master/grade_school_math/data and checked its content is as same as the huggingface parquet https://huggingface.co/datasets/openai/gsm8k/tree/main/main

The few-shot prompt (from https://github.com/QwenLM/Qwen/blob/main/eval/gsm8k_prompt.txt) is correctly added.

I only modified these three lines: sent = tokenizer.tokenizer.decode(tokens[raw_text_len:]) -> sent = tokenizer.decode(tokens[raw_text_len:]) input_ids = tokenizer.tokenizer.encode(input_txt) -> input_ids = tokenizer.encode(input_txt) dataset = load_from_disk(args.sample_input_file) -> data_files = {'train': args.sample_input_file+'train.json', 'test': args.sample_input_file+'test.json'} dataset = load_dataset('json', data_files=data_files)

运行环境 | Environment

My environment: ubuntu 18.04
Tesla V100-SXM2-32GB * 8

python                    3.10.14
torch                     2.2.0
transformers              4.41.2
CUDA: 12.1

备注 | Anything else?

No response

TanateT commented 1 month ago

I also encounter the same question. what do I need to modify?

github-actions[bot] commented 3 days ago

This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread. 此问题由于长期未有新进展而被系统自动标记为不活跃。如果您认为它仍有待解决，请在此帖下方留言以补充信息。

QwenLM / Qwen