[Quantization] Long latency for generating first token

youki-sada commented 1 month ago

Environment

RTX8000 GPU
TensorRT-LLM v0.9.0
Model
LLaVA v1.5 7B (LLaMA2 7B)
fp16 and int8/int4 weight quantization
batchsize = 16
Script
official examples/multimodal/run.py
Who can help?

@Tracin

Question

We measured execution speed for generating a first token and tokens coming after the first one. Compared with fp16 latency, int8 and int4 latency are about +25% long for the first token. Is it due to casting time of int4/8 to fp16? It is slow than I expected.

Latency [ms/step] output token#	fp16	int8	int4
1st	1981.4	2687.7	2609.2
2nd~last token	34.6	23.3	18.0
Avg. total	4922.4	4668.2	4139.2

byshiue commented 1 month ago

Yes. It is caused by casting. For long context case, it is expected that int8/int4 would be slower than fp16.

youki-sada commented 1 month ago

@byshiue Thank you for your reply. For generating 1st output token, TC utilization (TENSO) of LLaMA w/ int4 WQ is lower than fp16 and also general CNN models. I assume int4/int8 implementation should be utilized weight reuse and casting cost should be much lower.

1st generation (int4 weight)

$ dcgmi dmon -i 0 -e 1002,1003,1004,1005 -d 200
#Entity   SMACT        SMOCC        TENSO        DRAMA
ID
GPU 0     0.996        0.274        0.622        0.661
GPU 0     0.995        0.284        0.571        0.669
GPU 0     0.996        0.274        0.623        0.658
GPU 0     0.995        0.284        0.572        0.667
GPU 0     0.986        0.277        0.614        0.651

2nd token generation (int4 weight)

$ dcgmi dmon -i 0 -e 1002,1003,1004,1005 -d 200
#Entity   SMACT        SMOCC        TENSO        DRAMA
ID
GPU 0     0.912        0.600        0.175        0.775
GPU 0     0.910        0.601        0.174        0.775
GPU 0     0.912        0.603        0.172        0.776
GPU 0     0.916        0.607        0.173        0.777
GPU 0     0.916        0.607        0.171        0.777

1st token generation (fp16 weight)

#Entity   SMACT        SMOCC        TENSO        DRAMA
ID
GPU 0     0.991        0.347        0.656        0.399
GPU 0     0.992        0.357        0.623        0.404
GPU 0     0.992        0.339        0.667        0.400
GPU 0     0.990        0.345        0.659        0.397
GPU 0     0.991        0.355        0.629        0.403

2nd token generation (fp16 weight)

#Entity   SMACT        SMOCC        TENSO        DRAMA
ID
GPU 0     0.929        0.312        0.063        0.820
GPU 0     0.929        0.315        0.063        0.821
GPU 0     0.930        0.316        0.063        0.822
GPU 0     0.929        0.316        0.063        0.821
GPU 0     0.928        0.318        0.063        0.820

byshiue commented 1 month ago

For generating 1st output token, TC utilization (TENSO) of LLaMA w/ int4 WQ is lower than fp16 and also general CNN models. I assume int4/int8 implementation should be utilized weight reuse and casting cost should be much lower.

I don't get the point about the "general CNN model". What does it mean?

Also, what's the meaning of "utilized weight reuse"?

youki-sada commented 1 month ago

what's the meaning of "utilized weight reuse"?

I meant computational intensity is high in first token inference. Thus, I assume DRAMA of the int4 first inference should be reduced and TENSO should be around 65% as well as fp16. But there's only 10% reduction in memory bandwidth and I think it cannot be explained by casting cost only. Maybe I need to check the CUDA implementation for further discussion.

Void1024 commented 1 month ago

Thus, I assume DRAMA of the int4 first inference should be reduced and TENSO should be around 65% as well as fp16. But there's only 10% reduction in memory bandwidth and I think it cannot be explained by casting cost only.

Hi, I'm a little confused. Do you mean that compared to subsequent inference, w4a16's first token will have a higher memory throughput drop than 10%? Or do you think the compute throughput of first token is lower than 65%, which is not in line with expectations?

NVIDIA / TensorRT-LLM