QwenLM / Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Apache License 2.0
13.59k stars 1.11k forks source link

qwen性能测试,针对fp32、bf16和int4不同精度 #1017

Closed MuyeMikeZhang closed 6 months ago

MuyeMikeZhang commented 8 months ago

测试发现,fp32速度反而最快,int4速度反而最慢,是这样吗

jklj077 commented 8 months ago

跟卡型、批大小、使用的框架、软件版本、模型大小都有关。

正常情况下,A100,vLLM框架(batch size为1),7B模型,考虑满载吞吐量,int4 > bf16 > fp32。

在支持bf16的卡型上,bf16比fp32吞吐低是不正常的;int4模型如果没有合适的kernel (比如安装不正确),吞吐量会很低。

MuyeMikeZhang commented 8 months ago

在不使用vllm的情况下,直接用官方的qwen推理框架,在A100-80G单卡上,测试发现精度越高,反而吞吐量越低,并在使用flash_attn后,速度比不使用情况下更慢,这种是有问题吗

jklj077 commented 6 months ago

Please try the docker image provided to rule out environment issues. It is abnormal that using fp32 is faster than using bf16.

xiaotukuaipao12318 commented 5 months ago

Your device does NOT support faster inference with fp16, please switch to fp32 which is likely to be faster Try importing flash-attention for faster inference... Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention 请问大佬千问如何进行fp32推理了