关于加速效果 - Githubissues

bytedance / decoupleQ

A quantization algorithm for LLM

Apache License 2.0

94 stars 5 forks source link

Open hikq123 opened 3 months ago

hikq123 commented 3 months ago

非常棒的工作，请问有在GPU上实际的推理加速效果吗？比如说Llama2-13B，输入512，输出512下的推理速度？谢谢！

MyPandaShaoxiang commented 3 months ago

目前的接口w2相比fp16有2x的提升，没有做过Llama2-13b的具体评测，关于algosearch版本我们正在推出，相比fp16的加速效果会提至3x左右

hikq123 commented 3 months ago

是指对比TensorRT-LLM的fp16有2~3x的提升吗？

MyPandaShaoxiang commented 3 months ago

是指对比TensorRT-LLM的fp16有2~3x的提升吗？

相比torch的fp16性能，fp16的性能torch和trtllm的是相近的

hikq123 commented 3 months ago

期待你们在tensorrt-llm上的性能结果