SNU-ARC / any-precision-llm

[ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
MIT License
83 stars 3 forks source link

No speedup from quantization #3

Closed jiwonsong-dev closed 4 months ago

jiwonsong-dev commented 4 months ago

I ran demo.py with LLaMA-2-7B model to check throughput improvement across precisions. 8-bit model is about 10% faster than FP16 baseline but all precisions from 3-bit to 8-bit showed same throughput. Tested on A6000 and A100 GPU.

How can I resolve this problem?

image

ilil96 commented 4 months ago

The current demo may not demonstrate a speedup if the CPU is not fast enough, causing the software stack running on the CPU to become a bottleneck. In this case, you might not see any end-to-end speedup even with the GPU kernel speedup. With faster CPU cores, you might observe the speedup.

Please note that this demo is just a proof-of-concept implementation. To fully translate the kernel speedup into an end-to-end speedup, it needs to be integrated with a high-performance inference engine like TensorRT-LLM, as reported in our paper. We are working on finalizing the code for TensorRT-LLM end-to-end integration, but it may take some time to release.

jiwonsong-dev commented 4 months ago

Thank you for the response. Waiting for TensorRT-LLM integration!

jiwonsong-dev commented 2 months ago

Hello. Is TensorRT-LLM integration is still going on?

ilil96 commented 2 months ago

Hello. Is TensorRT-LLM integration is still going on?

It is currently delayed. I apologize for not being able to provide an exact timeline at this moment.