qwen性能测试，针对fp32、bf16和int4不同精度

QwenLM / Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

Apache License 2.0

13.59k stars 1.11k forks source link

Closed MuyeMikeZhang closed 6 months ago

MuyeMikeZhang commented 8 months ago

测试发现，fp32速度反而最快，int4速度反而最慢，是这样吗

jklj077 commented 8 months ago

跟卡型、批大小、使用的框架、软件版本、模型大小都有关。

正常情况下，A100，vLLM框架(batch size为1)，7B模型，考虑满载吞吐量，int4 > bf16 > fp32。

在支持bf16的卡型上，bf16比fp32吞吐低是不正常的；int4模型如果没有合适的kernel （比如安装不正确），吞吐量会很低。

MuyeMikeZhang commented 8 months ago

在不使用vllm的情况下，直接用官方的qwen推理框架，在A100-80G单卡上，测试发现精度越高，反而吞吐量越低，并在使用flash_attn后，速度比不使用情况下更慢，这种是有问题吗

jklj077 commented 6 months ago

Please try the docker image provided to rule out environment issues. It is abnormal that using fp32 is faster than using bf16.

xiaotukuaipao12318 commented 5 months ago

Your device does NOT support faster inference with fp16, please switch to fp32 which is likely to be faster Try importing flash-attention for faster inference... Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention 请问大佬千问如何进行fp32推理了