Open youki-sada opened 1 month ago
Yes. It is caused by casting. For long context case, it is expected that int8/int4 would be slower than fp16.
@byshiue Thank you for your reply. For generating 1st output token, TC utilization (TENSO) of LLaMA w/ int4 WQ is lower than fp16 and also general CNN models. I assume int4/int8 implementation should be utilized weight reuse and casting cost should be much lower.
$ dcgmi dmon -i 0 -e 1002,1003,1004,1005 -d 200
#Entity SMACT SMOCC TENSO DRAMA
ID
GPU 0 0.996 0.274 0.622 0.661
GPU 0 0.995 0.284 0.571 0.669
GPU 0 0.996 0.274 0.623 0.658
GPU 0 0.995 0.284 0.572 0.667
GPU 0 0.986 0.277 0.614 0.651
$ dcgmi dmon -i 0 -e 1002,1003,1004,1005 -d 200
#Entity SMACT SMOCC TENSO DRAMA
ID
GPU 0 0.912 0.600 0.175 0.775
GPU 0 0.910 0.601 0.174 0.775
GPU 0 0.912 0.603 0.172 0.776
GPU 0 0.916 0.607 0.173 0.777
GPU 0 0.916 0.607 0.171 0.777
#Entity SMACT SMOCC TENSO DRAMA
ID
GPU 0 0.991 0.347 0.656 0.399
GPU 0 0.992 0.357 0.623 0.404
GPU 0 0.992 0.339 0.667 0.400
GPU 0 0.990 0.345 0.659 0.397
GPU 0 0.991 0.355 0.629 0.403
#Entity SMACT SMOCC TENSO DRAMA
ID
GPU 0 0.929 0.312 0.063 0.820
GPU 0 0.929 0.315 0.063 0.821
GPU 0 0.930 0.316 0.063 0.822
GPU 0 0.929 0.316 0.063 0.821
GPU 0 0.928 0.318 0.063 0.820
For generating 1st output token, TC utilization (TENSO) of LLaMA w/ int4 WQ is lower than fp16 and also general CNN models. I assume int4/int8 implementation should be utilized weight reuse and casting cost should be much lower.
I don't get the point about the "general CNN model". What does it mean?
Also, what's the meaning of "utilized weight reuse"?
what's the meaning of "utilized weight reuse"?
I meant computational intensity is high in first token inference. Thus, I assume DRAMA of the int4 first inference should be reduced and TENSO should be around 65% as well as fp16. But there's only 10% reduction in memory bandwidth and I think it cannot be explained by casting cost only. Maybe I need to check the CUDA implementation for further discussion.
Thus, I assume DRAMA of the int4 first inference should be reduced and TENSO should be around 65% as well as fp16. But there's only 10% reduction in memory bandwidth and I think it cannot be explained by casting cost only.
Hi, I'm a little confused. Do you mean that compared to subsequent inference, w4a16's first token will have a higher memory throughput drop than 10%? Or do you think the compute throughput of first token is lower than 65%, which is not in line with expectations?
Environment
Model
Script
examples/multimodal/run.py
Who can help?
@Tracin
Question
We measured execution speed for generating a first token and tokens coming after the first one. Compared with fp16 latency, int8 and int4 latency are about +25% long for the first token. Is it due to casting time of int4/8 to fp16? It is slow than I expected.