THUDM / GLM-4

GLM-4 series: Open Multilingual Multimodal Chat LMs | 开源多语言多模态对话模型
Apache License 2.0
3.28k stars 235 forks source link

mmlu benchmark 无法复现根据当前代码 #217

Open chunniunai220ml opened 1 week ago

chunniunai220ml commented 1 week ago

System Info / 系統信息

vLLM Version: 0.5.0.post1 [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu11==2.19.3 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] pytorch-triton-rocm==2.2.0 [pip3] sentence-transformers==3.0.1 [pip3] torch==2.3.0 [pip3] torchaudio==2.2.1+cu118 [pip3] torchvision==0.18.0 [pip3] transformers==4.40.0 [pip3] triton==2.3.0 [conda] intel-extension-for-pytorch 2.2.0 pypi_0 pypi [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-nccl-cu11 2.19.3 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] pytorch-triton-rocm 2.2.0 pypi_0 pypi [conda] sentence-transformers 3.0.1 pypi_0 pypi [conda] torch 2.3.0 pypi_0 pypi [conda] torchaudio 2.2.1+cu118 pypi_0 pypi [conda] torchvision 0.18.0 pypi_0 pypi [conda] transformers 4.40.0 pypi_0 pypi [conda] triton 2.3.0 pypi_0 pypi

python==3.10 Tesla-V100

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

Reproduction / 复现过程

image

根据最新的代码,和openai : https://github.com/openai/simple-evals, 以及https://github.com/THUDM/GLM-4/blob/main/basic_demo/README.md, mmlu只有45.4, 与 report GLM-4-9B-Chat=72.4, 差距很大, 由于硬件限制,单卡,model_dtype=fp16

在A100上测试,4卡, mmlu =45.7, model_dtype=bf16

Expected behavior / 期待表现

复现精度细节, 或者直接给出复现代码

zRzRzRzRzRzRzR commented 1 week ago

我们的测试在bf16部署,simple evals测试,没有额外的加入系统提示词

单卡载入一个完整的模型 BF16,sequence-length 设置长一点试试吗,这个看上去没有回答完?

chunniunai220ml commented 1 week ago

A100, BF16, seqlen=8192, mmlu=68.24 sever端的process message 获取的文本格式, 跟HF不一致,有影响吗? 或者说以那个为准

inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}], add_generation_prompt=True, tokenize=True, return_tensors="pt", return_dict=True ) gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}

zRzRzRzRzRzRzR commented 1 week ago

以hf为准,server没有tokenize是因为要传入给vLLM,要用hf调用的方式,trans_cli_demo这个文件

chunniunai220ml commented 1 week ago

params_dict = { "n": 1, "best_of": 1, "presence_penalty": 1.0, "frequency_penalty": 0.0, "temperature": temperature, "top_p": top_p, "top_k": -1, "repetition_penalty": repetition_penalty, "use_beam_search": False, "length_penalty": 1, "early_stopping": False, "stop_token_ids": [151329, 151336, 151338], "ignore_eos": False, "max_tokens": max_new_tokens, "logprobs": None, "prompt_logprobs": None, "skip_special_tokens": True, sever里的这些参数是不是都不要改,能复现? client 请求,参考了request.py, 里面传输的client参数有啥需要注意的吗?

zRzRzRzRzRzRzR commented 1 week ago

repetition_penalty是1 其他应该都不用管,不锁我们server是vLLM的我们自己测跑分是transformers的

chunniunai220ml commented 1 week ago

你的意思是,你们report benchmark 用的基于transformers的,我应该参考trans_cli_demo这个文件, vllm你们有测试的结果,我想看看能不能对齐,repetition_penalty=1, seqlen=8192, got mmlu=0.7124 ,比较接近你们report benchmark 打印process_message的时候, 发现tools添加了额外的信息,这个对benchmark有影响吗? copy的request.py,里面tools: self.tools= [ { "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. San Francisco, CA", }, "format": { "type": "string", "enum": ["celsius", "fahrenheit"], "description": "The temperature unit to use. Infer this from the users location.", }, }, "required": ["location", "format"], }, } }, ]

还有"top_p": top_p,,设置多少合适? server.py里面"top_k": -1,, hf给的gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}

zRzRzRzRzRzRzR commented 1 week ago

top_k 是 1 top_p 0.8 tools不要加入任何内容,这会影响构建提示词

chunniunai220ml commented 1 week ago

{'n': 1, 'best_of': 1, 'presence_penalty': 1.0, 'frequency_penalty': 0.0, 'temperature': 0.6, 'top_p': 0.8, 'top_k': 1, 'repetition_penalty': 1.0, 'use_beam_search': False, 'length_penalty': 1, 'early_stopping': False, 'stop_token_ids': [151329, 151336, 151338], 'ignore_eos': False, 'max_tokens':2500, 'logprobs': None, 'prompt_logprobs': None, 'skip_special_tokens': True}

根据你给的信息,这种配置,got mmlu= 0.7232 , 接近report 72.4了,有这点儿精度差异正常吗? 后者还有那个里可以适当改一下? tools=[]了

zRzRzRzRzRzRzR commented 1 week ago

这个精度可以接受, tools 只要不带有内容我们构造提示词都不会涉及到function call的,就不会影响跑分

chunniunai220ml commented 1 week ago

还有个问题, 就是openai的simples-eval, 只取了2500samples,从他提供的mmlu.csv里面, 你们report的测评也是吗?每次test的结果现在不一致了

zRzRzRzRzRzRzR commented 4 days ago

你是 说这个嘛 ttps://github.com/openai/simple-evals/blob/294cb1fb18f7aed4e21dc567350b0761a9e6f699/mmlu_eval.py

simples-eval是有prompt的,没有加入别的提示词语,这个跑分误差有多大

chunniunai220ml commented 4 days ago

ttps://github.com/openai/simple-evals/blob/294cb1fb18f7aed4e21dc567350b0761a9e6f699/mmlu_eval.py

是的, 你是指 prompt_messages = [ sampler._pack_message(content=format_multichoice_question(row), role="user") ] 这句话吗? 我检查过, 输入没有问题额外添加提示词, def _pack_message(self, role: str, content: Any): return {"role": str(role), "content": content}