Closed jiwonsong-dev closed 4 months ago
The current demo may not demonstrate a speedup if the CPU is not fast enough, causing the software stack running on the CPU to become a bottleneck. In this case, you might not see any end-to-end speedup even with the GPU kernel speedup. With faster CPU cores, you might observe the speedup.
Please note that this demo is just a proof-of-concept implementation. To fully translate the kernel speedup into an end-to-end speedup, it needs to be integrated with a high-performance inference engine like TensorRT-LLM, as reported in our paper. We are working on finalizing the code for TensorRT-LLM end-to-end integration, but it may take some time to release.
Thank you for the response. Waiting for TensorRT-LLM integration!
Hello. Is TensorRT-LLM integration is still going on?
Hello. Is TensorRT-LLM integration is still going on?
It is currently delayed. I apologize for not being able to provide an exact timeline at this moment.
I ran demo.py with LLaMA-2-7B model to check throughput improvement across precisions. 8-bit model is about 10% faster than FP16 baseline but all precisions from 3-bit to 8-bit showed same throughput. Tested on A6000 and A100 GPU.
How can I resolve this problem?