[Docs] 问lmdeploy中的w8a8-triton实现是否有实际llm（如llama2，qwen2）的推理速度加速效果的benchmark测试？

brisker commented 1 month ago

📚 The doc issue

问lmdeploy中的w8a8-triton实现是否有实际llm（如llama2，qwen2）的推理速度加速效果的benchmark测试？

Suggest a potential alternative/fix

问lmdeploy中的w8a8-triton实现是否有实际llm（如llama2，qwen2）的推理速度加速效果的benchmark测试？

grimoire commented 1 month ago

A100 llama3-8b tp1

fp16

--------------------------------------------------
concurrency: 256
elapsed_time: 152.330s

first token latency(s)(min, max, ave): 0.099, 4.162, 0.664
per-token latency(s) percentile(50, 75, 95, 99): [0.032, 0.034, 0.231, 0.41]

number of prompt tokens: 676779
number of completion tokens: 612685
token throughput (completion token): 4022.103 token/s
token throughput (prompt + completion token): 8464.964 token/s
RPS (request per second): 19.694 req/s
RPM (request per minute): 1181.649 req/min
--------------------------------------------------

w8a8

--------------------------------------------------
concurrency: 256
elapsed_time: 138.981s

first token latency(s)(min, max, ave): 0.045, 4.061, 0.625
per-token latency(s) percentile(50, 75, 95, 99): [0.036, 0.039, 0.161, 0.273]

number of prompt tokens: 676779
number of completion tokens: 612685
token throughput (completion token): 4408.406 token/s
token throughput (prompt + completion token): 9277.984 token/s
RPS (request per second): 21.586 req/s
RPM (request per minute): 1295.140 req/min
--------------------------------------------------

brisker commented 1 month ago

@grimoire [奇怪的是我使用你们的kernel代码直接测试（https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/pytorch/kernels/cuda/w8a8_triton_kernels.py），w8a8速度反而远远慢于 fp16：我使用的是A100-80G显卡， triton版本2.2.0 是这个版本本身就很慢吗？

forward:
          M  int8_dynamic_triton_op  float_torch
0       1.0                0.032768     0.040960
1      16.0                0.032768     0.040960
2      32.0                0.032768     0.043008
3      64.0                0.033792     0.041984
4     128.0                0.033792     0.047104
5     256.0                0.038912     0.061440
6    1024.0                0.101376     0.161792
7    2048.0                0.164864     0.279552
8    3072.0                0.256000     0.436224
9    4096.0                0.321536     0.586752
10   5120.0                0.386048     0.739328
11   6144.0                0.476160     0.908288
12   7168.0                0.545792     1.058816
13   8192.0                0.619520     1.189888
14   9216.0                0.707584     1.354752
15  10240.0                0.783360     1.512448
16  11264.0                0.870400     1.658880
17  12288.0                0.943104     1.793024
18  13312.0                1.020928     1.966080
19  14336.0                1.107968     2.117632
20  15360.0                1.180672     2.269184
21  16384.0                1.252352     2.408448

InternLM / lmdeploy

[Docs] 问lmdeploy中的w8a8-triton实现是否有实际llm（如llama2，qwen2）的推理速度加速效果的benchmark测试？ #2567

📚 The doc issue

Suggest a potential alternative/fix

InternLM / lmdeploy

[Docs] 问lmdeploy中的w8a8-triton实现是否有 实际llm（如llama2，qwen2）的推理速度加速效果的benchmark测试？ #2567

📚 The doc issue

Suggest a potential alternative/fix

[Docs] 问lmdeploy中的w8a8-triton实现是否有实际llm（如llama2，qwen2）的推理速度加速效果的benchmark测试？ #2567