WisdomShell / codeshell

A series of code large language models developed by PKU-KCL
http://se.pku.edu.cn/kcl
Other
1.61k stars 119 forks source link

基于TGI的模型server无法在v100上进行推理加速 #45

Open BruceMa29 opened 10 months ago

BruceMa29 commented 10 months ago

@ruixie 您好,由于v100无法使用FlashAttention,所以是不是也没办法在v100上对模型服务进行推理加速呢?

ruixie commented 10 months ago

你好,TGI的推理加速很大部分是依赖FlashAttention的,所以在V100上加速效果确实会受到影响

BruceMa29 commented 10 months ago

@ruixie 好的,因为FlashAttention在v100上跑不起来,所以那也就没法进行加速了。原生transformers压测的吞吐量很低