Closed MuyeMikeZhang closed 6 months ago
跟卡型、批大小、使用的框架、软件版本、模型大小都有关。
正常情况下,A100,vLLM框架(batch size为1),7B模型,考虑满载吞吐量,int4 > bf16 > fp32。
在支持bf16的卡型上,bf16比fp32吞吐低是不正常的;int4模型如果没有合适的kernel (比如安装不正确),吞吐量会很低。
在不使用vllm的情况下,直接用官方的qwen推理框架,在A100-80G单卡上,测试发现精度越高,反而吞吐量越低,并在使用flash_attn后,速度比不使用情况下更慢,这种是有问题吗
Please try the docker image provided to rule out environment issues. It is abnormal that using fp32 is faster than using bf16.
Your device does NOT support faster inference with fp16, please switch to fp32 which is likely to be faster Try importing flash-attention for faster inference... Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention 请问大佬千问如何进行fp32推理了
测试发现,fp32速度反而最快,int4速度反而最慢,是这样吗