NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.37k stars 796 forks source link

[Quantization] Long latency for generating first token #1565

Open youki-sada opened 1 month ago

youki-sada commented 1 month ago

Environment

Question

We measured execution speed for generating a first token and tokens coming after the first one. Compared with fp16 latency, int8 and int4 latency are about +25% long for the first token. Is it due to casting time of int4/8 to fp16? It is slow than I expected.

Latency [ms/step] output token# fp16 int8 int4
1st 1981.4 2687.7 2609.2
2nd~last token 34.6 23.3 18.0
Avg. total 4922.4 4668.2 4139.2
byshiue commented 1 month ago

Yes. It is caused by casting. For long context case, it is expected that int8/int4 would be slower than fp16.

youki-sada commented 1 month ago

@byshiue Thank you for your reply. For generating 1st output token, TC utilization (TENSO) of LLaMA w/ int4 WQ is lower than fp16 and also general CNN models. I assume int4/int8 implementation should be utilized weight reuse and casting cost should be much lower.

1st generation (int4 weight)

$ dcgmi dmon -i 0 -e 1002,1003,1004,1005 -d 200
#Entity   SMACT        SMOCC        TENSO        DRAMA
ID
GPU 0     0.996        0.274        0.622        0.661
GPU 0     0.995        0.284        0.571        0.669
GPU 0     0.996        0.274        0.623        0.658
GPU 0     0.995        0.284        0.572        0.667
GPU 0     0.986        0.277        0.614        0.651

2nd token generation (int4 weight)

$ dcgmi dmon -i 0 -e 1002,1003,1004,1005 -d 200
#Entity   SMACT        SMOCC        TENSO        DRAMA
ID
GPU 0     0.912        0.600        0.175        0.775
GPU 0     0.910        0.601        0.174        0.775
GPU 0     0.912        0.603        0.172        0.776
GPU 0     0.916        0.607        0.173        0.777
GPU 0     0.916        0.607        0.171        0.777

1st token generation (fp16 weight)

#Entity   SMACT        SMOCC        TENSO        DRAMA
ID
GPU 0     0.991        0.347        0.656        0.399
GPU 0     0.992        0.357        0.623        0.404
GPU 0     0.992        0.339        0.667        0.400
GPU 0     0.990        0.345        0.659        0.397
GPU 0     0.991        0.355        0.629        0.403

2nd token generation (fp16 weight)

#Entity   SMACT        SMOCC        TENSO        DRAMA
ID
GPU 0     0.929        0.312        0.063        0.820
GPU 0     0.929        0.315        0.063        0.821
GPU 0     0.930        0.316        0.063        0.822
GPU 0     0.929        0.316        0.063        0.821
GPU 0     0.928        0.318        0.063        0.820
byshiue commented 1 month ago

For generating 1st output token, TC utilization (TENSO) of LLaMA w/ int4 WQ is lower than fp16 and also general CNN models. I assume int4/int8 implementation should be utilized weight reuse and casting cost should be much lower.

I don't get the point about the "general CNN model". What does it mean?

Also, what's the meaning of "utilized weight reuse"?

youki-sada commented 1 month ago

what's the meaning of "utilized weight reuse"?

I meant computational intensity is high in first token inference. Thus, I assume DRAMA of the int4 first inference should be reduced and TENSO should be around 65% as well as fp16. But there's only 10% reduction in memory bandwidth and I think it cannot be explained by casting cost only. Maybe I need to check the CUDA implementation for further discussion.

Void1024 commented 1 month ago

Thus, I assume DRAMA of the int4 first inference should be reduced and TENSO should be around 65% as well as fp16. But there's only 10% reduction in memory bandwidth and I think it cannot be explained by casting cost only.

Hi, I'm a little confused. Do you mean that compared to subsequent inference, w4a16's first token will have a higher memory throughput drop than 10%? Or do you think the compute throughput of first token is lower than 65%, which is not in line with expectations?