Open brisker opened 1 month ago
A100 llama3-8b tp1
fp16
--------------------------------------------------
concurrency: 256
elapsed_time: 152.330s
first token latency(s)(min, max, ave): 0.099, 4.162, 0.664
per-token latency(s) percentile(50, 75, 95, 99): [0.032, 0.034, 0.231, 0.41]
number of prompt tokens: 676779
number of completion tokens: 612685
token throughput (completion token): 4022.103 token/s
token throughput (prompt + completion token): 8464.964 token/s
RPS (request per second): 19.694 req/s
RPM (request per minute): 1181.649 req/min
--------------------------------------------------
w8a8
--------------------------------------------------
concurrency: 256
elapsed_time: 138.981s
first token latency(s)(min, max, ave): 0.045, 4.061, 0.625
per-token latency(s) percentile(50, 75, 95, 99): [0.036, 0.039, 0.161, 0.273]
number of prompt tokens: 676779
number of completion tokens: 612685
token throughput (completion token): 4408.406 token/s
token throughput (prompt + completion token): 9277.984 token/s
RPS (request per second): 21.586 req/s
RPM (request per minute): 1295.140 req/min
--------------------------------------------------
@grimoire [奇怪的是我使用你们的kernel代码直接测试(https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/pytorch/kernels/cuda/w8a8_triton_kernels.py),w8a8速度反而远远慢于 fp16: 我使用的是A100-80G显卡, triton版本2.2.0 是这个版本本身就很慢吗?
forward:
M int8_dynamic_triton_op float_torch
0 1.0 0.032768 0.040960
1 16.0 0.032768 0.040960
2 32.0 0.032768 0.043008
3 64.0 0.033792 0.041984
4 128.0 0.033792 0.047104
5 256.0 0.038912 0.061440
6 1024.0 0.101376 0.161792
7 2048.0 0.164864 0.279552
8 3072.0 0.256000 0.436224
9 4096.0 0.321536 0.586752
10 5120.0 0.386048 0.739328
11 6144.0 0.476160 0.908288
12 7168.0 0.545792 1.058816
13 8192.0 0.619520 1.189888
14 9216.0 0.707584 1.354752
15 10240.0 0.783360 1.512448
16 11264.0 0.870400 1.658880
17 12288.0 0.943104 1.793024
18 13312.0 1.020928 1.966080
19 14336.0 1.107968 2.117632
20 15360.0 1.180672 2.269184
21 16384.0 1.252352 2.408448
📚 The doc issue
问lmdeploy中的w8a8-triton实现是否有 实际llm(如llama2,qwen2)的推理速度加速效果的benchmark测试?
Suggest a potential alternative/fix
问lmdeploy中的w8a8-triton实现是否有 实际llm(如llama2,qwen2)的推理速度加速效果的benchmark测试?